Relief for the Solution Architect: Pushing Back on HPC Cluster Complexity with Warewulf and Apptainer

[SPONSORED CONTENT]  How did you, at heart and by training a research scientist, financial analyst or product design engineer doing multi-physics CAE, how did you end up as a… systems administrator? You set out to be one thing and became something else entirely. You finished school and began working with some hefty HPC-class clusters. One day, there’s a system problem and you, poor soul, step forward and put in a fix. Someone – probably someone more senior – throws you a compliment: “Wow, that’s impressive. Man, I could never have figured it out myself …,” that sort of thing.

Word gets around and it isn’t long before you’re the go-to person when something goes wrong with the cluster, which is often enough. Soon, there you are, sitting in front of a bank of screens monitoring the system while everyone else is doing science, balancing hedge fund portfolios or simulating cool new product designs. And you may ask yourself, “Well, how did I get here?”**

Organizations that rely on clusters – be they 100 nodes or 1,000 – would be nowhere without systems administrators, a.k.a., solution architects. It’s head splitting, painstaking work that lacks the glamour that comes from actually using clusters. But everyone from the CEO on down knows that without good solution architects their organizations would grind to a halt.

And there aren’t nearly enough of them. Clusters are bigger, more complicated, more powerful and more heterogenous than ever, and they’re getting harder to manage as they take on bigger, more complex jobs.

“You don’t start out thinking, ‘I’m going to get into system administration of clusters,’” Glen Otero, a Ph.D. who is Director of Scientific Computing, Genomics AI and Machine Learning at CIQ , a technology firm with expertise in HPC-class clusters.  “You start out as somebody who’s going to go do something huge in science. But you wind up in this space, because – we joke about it – you volunteered to set up the system. And then once you did, it’s like, ‘Hey, can you also do this? Can you also do that?’ And then you wake up one day and you’re like, ‘Where did my life go? I was supposed to do research.’”

CIQ at SC22

Cluster provisioning and management has demanded solutions that smooth and automate – at least partially automate – those processes for as long as clusters have existed. Three prominent open source projects have taken on cluster complexities, all three are the brainchild of Greg Kurtzer, the founder and CEO of CIQ.  The three projects are:

– The Rocky Linux operating system, based on the CentOS Linux distribution, which was started by Kurtzer and for which Red Hat withdrew support in December 2020 (see related insideHPC story), widely adopted by organizations that build large, complex, HPC-class clusters.

Warewulf, a cluster provisioning solution developed by Kurtzer starting in 2001 when he was running Linux clusters at Lawrence Berkeley National Laboratory for the Department of Energy.

Apptainer, also created at Berkeley Lab by Kurtzer, is a secure, performant application container system that began life as “Singularity,” an HPC-tailored response to Docker.

Kurtzer started up CIQ to provide Rocky Linux, Warewulf and Apptainer support, services, tools and other value adds, and it’s a driving force behind the open source communities contributing to the three projects. CIQ provides traditional HPC-related solutions and support, and it’s behind a computing paradigm leading the way towards cloud-native, hybrid, federated computing called HPC-2.0 (to be discussed in a later article on this site).

Greg Kurtzer

“Building and running clusters is hard, there’s no getting around that,” Brock Taylor, CIQ’s vice president of high performance computing and strategic partners, told us. “A cluster has thousands of components. When you add up all the hardware and software, the operating system alone has loads and loads of things in it. It takes a lot of effort to get there, a lot of expertise.”

When Beowulf clusters began in the early 1990s, provisioning was script-based, hands-on and build-it-yourself. Tools soon became available, open source tools such as Oscar, Rocks and Warewulf.

“So you have these provisioning systems that help make it easier to deploy clusters,” Taylor said, “but over time, the complexity keeps expanding. It’s like entropy, right? With clusters, it never gets simpler, it gets harder. The complexity always runs ahead of the solution.”

Commercial software offerings also came to the market, such as those from Platform Computing, based in large part on Rocks and later acquired by IBM, and from Bright Computing, which NVIDIA added to its enterprise stack last January.

But for advocates of the open source movement, there’s value in Warewulf and Apptainer remaining community-supported and vendor-neutral. That said, they aren’t panaceas – cluster entropy always remains, and there’s the problem of not enough system architects to meet the demand, particularly those who can successfully wade into the HPC cluster alligator pit.

“This is a big problem in HPC,” Taylor said. “Finding people who can stay on top of all the technology, and retaining them, it’s a shrinking pool. And as they gain more expertise in managing HPC systems, their price can go up and they have plenty of opportunities to go elsewhere.”

Warewulf helps with cluster management in part by simplifying the addition of new cluster nodes through the use of “images,” which, as Taylor said, “is where all the magic happens.” Images contain a complete software stack, a “golden snapshot,” of the resources – the software that harnesses the performance of compute, memory, networking, everything – within a node. Images enable the addition of new nodes that are an exact copy of the other nodes it will work with, ensuring that all the “plumbing and wiring is hooked up correctly and consistently,” Taylor said, “which is a fairly difficult job.”

In Rocky Linux-Warewulf-Apptainer shops, Warewulf images are delivered as containers to spin up compute nodes on the cluster. These can also contain variations on existing cluster nodes – say, a node with GPUs and CPUs, whereas the other nodes are CPU-only – but can still function as part of the cluster.

Jonathan Anderson, CIQ’s lead HPC solution architect, describes why the combination of Apptainer and Warewulf is a potent combination.

“Apptainer brings scientific computing end-users into the container ecosystem, giving them full control over the operating environment that their applications run in,” he said. Warewulf 4 brings cluster administrators into that same container ecosystem by basing compute node images on standard operating system containers. Bringing both users and administrators together in the same ecosystem allows them to better collaborate and build on each other’s work.”

This is where CIQ can play an invaluable role at HPC shops. The company has expertise not only at the foundational level of the operating system but also with Warewulf and Apptainer.

“Warewulf helps you keep your compute node software consistent while all your individual users run different applications, ‘snowflake’ applications, in containers,” Otero said. The three (Rocky, Apptainer, Warewulf) combined into an integrated whole means that organizations can build and expand clusters at scale, rapidly, in a lightweight way.

“Applications run in containers, and since those are platform-independent – because everything is wrapped up in a container – it allows the administrator to manage these nodes as all being the same,” Otero said. “Snowflake applications do arise, some nodes have GPUs in them, for instance, and the administrator may want to use Warewulf to create a slightly different Linux image that will work on those nodes. Warewulf allows them to push that container out to the node with the GPUs, and then Warewulf can just as easily reinstall that node back to its previous state.”

Node flexibility, scalability, provisioning and expansion of clusters, the easing of systems admin tasks – all these are coming within the grasp of organizations that depend on HPC clusters to get their work done.

And who knows, maybe some of those researchers, analysts and designers who morphed into systems administrators can spend more time doing what they were always meant to do in the first place.

** Talking Heads, “Once in a Lifetime