HPC: Stop Scaling the Hard Way

Print Friendly, PDF & Email

By Bjorn Kolbeck, CEO of Quobyte 

Three decades or so ago, at the dawn of high-performance computing, there was no such thing as commodity hardware in the server world. Organizations had their proprietary platforms, and they had to make do with squeezing as much performance from them as possible. Reliability was generally an afterthought, because there wasn’t much you could do about reliability one way or the other given the architectures of the day. And yet, thanks to that era’s low data volumes and limited algorithms, early HPC architectures worked.

As supercomputers and HPC clusters grew over time, those proprietary systems failed to scale effectively, particularly in their storage architectures. Two computers would form a pair that shared a back-end store, but those pairs were meant to operate in relative isolation. Scaling exposed architectural weaknesses when trying to scale these pairs into the hundreds.

Fast forward, and today’s situation is clear: HPC is struggling with reliability at scale. Well over 10 years ago, Google proved that commodity hardware was both cheaper and more effective for hyperscale processing when controlled by software-defined systems, yet the HPC market persists with its old-school, hardware-based paradigm. Perhaps this is due to prevailing industry momentum or working within the collective comfort zone of established practices. Either way, hardware-centric approaches to storage resiliency need to go. Google’s revolutionary software-centric approach to scale-out architecture opened the gates for far greater operational efficiency, and now it’s time for the HPC world to learn Google’s lessons.

Google Rejected Hardware Headaches

In contrast to the HPC world, Google started out with an engineering crew full of distributed systems experts, and with them came a different perspective. Of course, Google has had its share of missteps, but when it comes to scale-out computing, there’s no arguing with the search giant’s accomplishments.

First, Google bucked the trend of relying on specialized and even custom servers built to painstaking quality levels in an (often unsuccessful) attempt to be highly resilient. Instead, the company implemented legions of commodity, off-the-shelf servers that achieved cluster-level resilience through software. The platform approach became known as warehouse-scale machines. From a scaling perspective, this was a sensible play. As systems grow in size and complexity, the more common and costly their failures become. This is why most HPC storage system failures cause far more problems relative to similar failures in other fields that rely on software-centric architectures.

Google determined that it was pointless to ask how to prevent failures. Rather, the point became to anticipate failure and mitigate its impact. That one shift in perspective completely undermined the hardware-centric paradigm. Google proved that with commodity hardware, it was possible to move beyond trying to bridge hundreds of “island” pairs and instead build one single landmass where failure at any point was allowed and expected.

Paxos (Shutterstock)

The Lustre file system, which launched in 2003 and remains very popular in HPC environments, illustrates the point. Because it was a product of its time, Lustre required that resiliency be done in hardware. This added complexity and difficulty, since conventional RAID controllers alone wouldn’t suffice. To span multiple servers, engineers had to hardwire systems to allow multiple servers to see, mount, and share each other’s drives — but one drive can never be mounted by two systems at the same time. Thus, the two servers had to be able to switch each other off, and that required having a hub with a manageable power socket.

This approach was workable when HPC clusters were handling a petabyte of data, but scaling becomes highly constrained when dealing with modern datasets reaching into the tens or hundreds of petabytes. Managing storage across systems at that scale is a nightmare, and it’s because of the persistence of hardware-centrism. Google realized this and showed that resiliency managed in software becomes significantly easier.

Superior in Software

In 1989, computer scientist Leslie Lamport wrote a paper in which he defined the Paxos algorithm, named after a Greek island where (a fictional) Parliament had to reach conclusions and consensus with members constantly entering and exiting the legislative chamber. This followed Lamport’s earlier work on “The Byzantine Generals Problem” (solutions for which would go on to influence everything from modern flight control to Bitcoin), continuing his explorations into fault tolerance for distributed systems. Lamport’s first exploration of Paxos languished, but he published a simplified treatment in 2001 that found traction.

Google implemented the Paxos algorithm in its Chubby distributed lock service. Ceph, Apache Cassandra, Amazon Elastic Container Services, and others followed suit. In every case, a main objective is to alleviate hardware dependency and allow software to manage data redundancy across machines while also dealing with any failures. Now, instead of having small silos of high-availability servers, a hundred or even thousands of nodes can act together as a single, cohesive, self-healing entity.

Software-defined resiliency has been critical to Google’s scaling and thus its market dominance. The operational efficiency that stems from software-based infrastructure hinges on making human intervention the exception. Updates should not require people. Systems dropping offline should not (immediately) require people. Instead, a minimal admin staff should be able to manage cluster resources orders of magnitude greater than what can be achieved with a hardware-centric resiliency model.

In the same vein, cluster admins should be able to work independently from the system. In 2021, no one should have to be hands-on with an HPC cluster at 4:00 in the morning to shut down the system, just to perform an update. The cluster should not dictate operations, nor should it dictate when users can and can’t use the system. Once you decouple maintenance from the system and allow software to automate the process, efficiency increases for everyone involved.

HPC, It’s Time for a Better Paradigm

One can examine this hardware vs. software dichotomy as being an operational model question, since operations heavily involve manpower costs. Hardware-centric HPC solutions from the 1990s justified their high manpower costs because the specialized hardware of the time was very expensive. Overall, these costs were contained by deployments being small. But when manpower costs grow linearly with system size, the hardware-centric model clearly presents an operational problem. Data loads continue to grow exponentially, yet HPC persists with 20-year-old storage architectures and all the manpower costs that go with them.

Outside the HPC world, the industry migrated to Hadoop for mass-scale data analysis, but Hadoop may be more suited to Google-type workloads than HPC. Similarly, this isn’t an argument for scientists to run their applications with MapReduce. This is about Google’s strategy, not its tactics. It’s about leaving behind HPC’s legacy architectures to enable scalable growth and leaps in operational efficiency. HPC did successfully embrace object storage because it was available, scalable, manageable, and easy. Now, the same can be done for HPC file systems if operators accept systems based around true hyperscaler design.

There’s no need to carry on with proprietary or legacy storage approaches. If your HPC efforts are still constricted with hardware-based resiliency and inefficient maintenance, perhaps now is the time to follow Google’s example and take a profound leap forward.

Bjorn Kolbeck is CEO of Quobyte.

 

 

Comments

  1. thank you for sharing