How the HPC-AI Rocky Linux Server Operating System Rose from the CentOS Ashes

[SPONSORED CONTENT]  CentOS disappeared in the dead of winter. On December 8, 2020, the day with the earliest sunset of the year in northern latitudes, Red Hat announced it would no longer support the Linux server operating system, and for many CentOS users “what instruments we have agree the day of (its) death was a dark cold day.” **

If you were an advocate of CentOS Linux, you knew all about it. You knew its traits, its ways, its bugs, its quirks. You knew its personality. You knew how to tease the best out of it, and how to avoid the things it did that drove you crazy. You developed a CentOS skill set that became second nature. Together with CentOS, big things got done, systems and careers were built, successes achieved. So when CentOS passed into end-of-life two Decembers ago, something central was taken out of the work life of its users.

At that time, CentOS (Community Enterprise Operating System) was an open source, production-ready downstream version of Red Hat Enterprise Linux (RHEL), which built up a huge and devoted following in its 18-year existence. It was the no. 1 Linux distribution in the enterprise, and users included Toyota, GoDaddy, Disney, RackSpace and Verizon, organizations that build large, complex, HPC-class AI clusters.

Then, suddenly, Red Hat announced it was forsaking – that’s the word CentOS fans would use – the operating system for a new “distribution,” CentOS Stream. Red Hat stated up front that CentOS Stream wasn’t a CentOS replacement, “rather, it’s a natural, inevitable next step intended to fulfill the project’s goal of furthering enterprise Linux innovation.” But many CentOS users weren’t interested in a next step. According to an Ars Technica news report, “the comments on the community announcement are legion and are overwhelmingly negative.”

But if the news left in its wake despondent CentOS users, they went through the stages of grief quickly and resiliently. Just two hours after the Red Hat news, CentOS  founder and Linux open source guru Gregory Kurtzer stepped to the fore and announced via a comment on the CentOS website that he would again start an open source, community-owned effort, a new distribution that would be bug-for-bug compatible with RHEL and would continue on with CentOS’s mission. Its name: Rocky Linux, a tribute to CentOS co-founder Rocky McGaugh.

“Rocky was a big response to the ending of CentOS, which is very important to the HPC and AI community,” Brock Taylor, CIQ’s vice president of high performance computing and strategic partners, told us. “CentOS was the backbone for so many systems out there, especially when you’re thinking about multi-node environments, HPC clusters and AI running in multi-node environments. CentOS was the OS of choice in that space, and when support ended for it, an entire community wondered how they were going to move forward. It was a massive shock to the system.”

Kurtzer not only got the Rocky Enterprise Software Foundation (RESF) off the ground, he also founded and became CEO of CIQ, a technology company providing Rocky Linux support, services, and value adds, and a driving force behind the nascent operating system. CIQ is also a provider of not only traditional HPC solutions and support, but also a computing paradigm leading the way towards cloud-native, hybrid, federated computing (HPC-2.0).

Gregory Kurtzer, CIQ

Kurtzer, colleagues and members of the rapidly forming Rocky Linux community, some of whom came out of the CentOS community, quickly built momentum behind the upstart project. Taylor said that within a short period of time thousands of developers were gathering to stand up Rocky as the CentOS replacement. By December 12, just four days after the Red Hat announcement, the Rocky Linux code repository had become the top-trending repository on GitHub. Another project aspiring to respond to the vacuum in the marketplace, AlmaLinux, was published on March 30, 2021 and beat Rocky Linux to market due to the sharing of infrastructure, secure boot, and engineering  from its parent company Cloud Linux. By July of last year, the RESF released Rocky Linux 8.4.

Rocky’s momentum has continued. RESF reports that in a typical month, there are at least 250,000 OS image downloads with some months spiking to 750,000. The OS has enjoyed broad acceptance in the enterprise, across academic institutions and in the cloud sector, including Amazon Web Services, Microsoft Azure, Google Cloud Platform and Oracle Cloud Infrastructure. All this before the one-year commemoration of CentOS’ demise. That’s resilience.

Taylor attributes Rocky’s uptake to its steadfast dedication to RHEL compatibility, a hallmark of CentOS, and to bettering the community and project capabilities. In support of that objective, Rocky Linux Release 9, announced last July, includes Peridot, which enables development groups to reproduce and extend any version of Rocky Linux (incidentally, Release 9 does not mean there were eight previous versions of Rocky Linux, it indicates the new version is binary compatible with, yes, RHEL 9).

Taylor said Peridot works as a cloud-native stack for building Rocky Linux with tools designed to simplify working with the source code.

Brock Taylor, CIQ

“A key function of Peridot is to make sure Rocky Linux is actually bug-for-bug compatible with Red Hat Enterprise Linux, and that’s a big value to this community,” he said. “This is very similar to how CentOS stayed in sync with Red Hat, and it provides huge value to the open source community and in particular, a lot of the HPC infrastructure community, to make sure the OS is very solid. It’s tracking all the various things, the thousands of software components that need to be tracked, for compatibility.”

He cited Rocky Linux’s ability to track the operations of the kernel, which manages and connects resources across the OS, and how that capability can come into play with a chip such as AMD’s EPYC CPU.

“The EPYC architecture has eight cores on a chiplet and eight chiplets on a processor, giving you 64 cores,” said Taylor. “Those cores are sharing cache so if you have eight threads that are scheduled on the processor then you want one thread per chiplet. But you can have cases where, over time, the threads wind up migrating on to just one or two of the chiplets, meaning they’re competing for resources while other chiplets are idle, so you get inefficient performance.”

The latest Rocky Linux kernel has updates that mitigate problems of this type, Taylor said, ensuring equal distribution of processing resources, which in demanding HPC and AI workloads is particularly critical. And Peridot ensures kernel improvements of this type get into OS distributions.

Taylor explained how CIQ supports Rocky Linux users wrestling with complex, heterogeneous HPC-AI clusters. Those environments are commonly designed and maintained by HPC cluster administrators, those proverbial masters-of-all-IT-trades who are, naturally enough, so hard to find and hire. In fact, it’s common for these administrators to actually be researchers or data scientists themselves, or postdocs who have lost a bet with their peers. The role of CIQ consultants is to call on their Rocky Linux knowledge and reduce the computer science expertise that would otherwise be required of researchers or data scientists.

Taylor himself spent 22 years at Intel and AMD doing the same thing as these users, he said, “looking at the solution space” – i.e., figuring out how big, multi-architecture clusters come together.

“The data scientist wants to do data science, not cluster administration or Linux administration,” Taylor said. “They need the tooling that allows them to focus on that type of work. It starts with a strong basis in the operating system. What CIQ does with our strong connection to the OS and Rocky Linux is layer in the technologies that are specific to high performance computing and AI.

“We face a world of ever-expanding silicon solutions,” Taylor continued, “and the developers who are building the applications and the frameworks may be running on a general-purpose CPU, or on an accelerator GPGPU, or an FPGA. And they’ve got to figure out when and where they support all the different form factors and operations. They don’t want to discuss how they’re configuring the software and middleware elements, or how the drivers for their fabric are integrated into the operating system.”

It all calls for a tremendous amount of coordination across the software stack and this, Taylor said, is where CIQ can help.

** W.H. Auden, “In Memory of W.B. Yeats”