Sign up for our newsletter and get the latest HPC news and analysis.
Send me information from insideHPC:


I Really Don’t Care About TCO … It’s RCO I am Worried About

By Adam Marko, Director of Life Science Solutions – Panasas, Inc.

A look at what we are calling Research Cost of Ownership (RCO) and the effect of HPC storage downtime and reduced productivity on the overall scientific mission.

Consistent uptime is critical so researchers can run more experiments, bringing them closer to discovery. But if you can’t get things done in a fast, productive way, then you’re courting failure. This problem is further exacerbated by the fact that many HPC storage technologies are vulnerable to downtime and as a result, negatively impact productivity and slow down progress.

Unfortunately, scientists now grapple with the same problems already familiar to many HPC users, where high maintenance requirements and regular service interruptions get in the way of getting things done. In some cases, though, system downtime doesn’t just interfere with project timetables, it can actually delay time to a cure.

Pain points in HPC storage

Over the years, users acclimated themselves to the fact that HPC storage deployments were notoriously hard to manage. Organizations had to devote considerable staffing resources who could master the intricacies to operate these complicated storage systems. They were the only ones inside the organization capable of running these large, complicated installations.

As an industry, we cannot assume that HPC data center managers will have the ability to expend time, money, and staff to purchase and maintain clunky, complex HPC storage systems. The expiration date on that contract is long overdue.

Change should have come to the HPC storage space a long time ago. But for years, most storage buyers weren’t interested in metrics around total cost of ownership. That is now changing. Consider these findings from a Hyperion survey of data managers that Panasas recently commissioned:

  • Nearly half of the respondents experienced storage system failures once a month, with buyers coming to expect downtime as the norm in HPC storage.
  • After a system failure, 40% of HPC sites typically require more than two days to restore their storage system to full functionality.
  • The most often-named challenges for HPC storage operations are recruiting and hiring qualified staff, followed by the time and cost needed to tune and optimize the storage systems.
  • More than 75% of respondents experienced reduced productivity in the past year due to storage related issues. One in eight sites experienced this more than 10 times in the past 12 months.
  • Some outages lead to downtimes that last as long as a week. A single day of downtime costs can range from $100,000 to more than $1 million.

Clearly, issues with existing storage solutions continue to negatively impact organizational goals.

TCO needs to be thought of differently in the Life Sciences

It is compelling to make decisions on storage based entirely on minimizing initial purchase price. An open source system appears inexpensive initially, but as noted in the survey, storage issues after installation are common and often costly. However, the implications to an organization due to downtime are measured in more than dollars. As a researcher, the financial losses incurred from storage issues are not always relatable. However, the cost to research progress is very apparent.

Source: 2019 Hyperion Research Pulse Survey “New Study Details Importance of TCO for HPC Storage Buyers

Reduced productivity from infrastructure failures results in a delayed time to discovery. While this does have financial implications at an organization pursing a new pharmaceutical for example, calculating TCO traditionally is less straightforward at an academic center or collaboration effort. The reality is, there is no metric to accurately relate TCO to the overall cost of lost research time at any organization. In contrast, I prefer to consider something I call Research Cost of Ownership (RCO).

RCO is the effect that downtime, and reduced productivity, has on the overall scientific mission. While RCO is not immediately quantifiable, it needs to be a major consideration in the storage purchasing process. The goal of research is to contribute to the collective body of knowledge for humankind. It is how humanity makes discoveries and moves forward with innovation. When this is compromised by short-sighted financial decision making, it has a ripple effect across the global body of scientific knowledge.

It is time for IT staff to consider the RCO implications of storage issues. Maybe a few dollars per TB were saved on the initial purchase, but the cost to the scientific community can be much higher. We need to bring down the curtain on the recurring drama where researchers suffer repeated downtime. Predictability, resilience and reliability should become our new standard. Science is too important for it to be delayed by avoidable technical issues.

In this era everyone is examining their infrastructure to figure out the best technology approach in order to get their work done quickly and reliably. And that means things need to change in the HPC storage world. If we do it right, we’ll all be heroes.

Don’t compromise your RCO

The Hyperion findings constitute a wake-up call for the HPC industry to pivot in the direction of deploying accessible and reliable storage with high-performance parallel file systems that can support the challenging workloads of modern research. A reliable, scalable, commercially supported solution can meet this goal.

TCO is a fine quantifiable accounting exercise to consider when purchasing storage, but this financial aspect is a small part of the overall picture. The goal of research is to advance human understanding of our world. Unlike TCO, RCO is somewhat qualitative, but it is required to evaluate your infrastructure choices. Remember the immense cost to discovery if storage issues persist at your organization.

When making your next storage decision, RCO must be a key evaluation metric.  We need agile, flexible storage systems that can adapt to new challenges, enabling researchers to achieve their goals. The implications here are serious. Time to results matters.

Resource Links: