LLNL, IBM and Red Hat to Explore Standardized Resource Management Interface for Cloud HPC

Lawrence Livermore National Laboratory (LLNL), IBM and Red Hat are combining to develop best practices for interfacing high-performance computing (HPC) schedulers and cloud orchestrators, an effort designed for supercomputers that take advantage of cloud technologies.

Under a recently signed memorandum of understanding (MOU), the organizations said researchers aim to enable next-generation workloads by integrating LLNL’s Flux scheduling framework with Red Hat OpenShift — an enterprise Kubernetes platform — to allow more traditional HPC jobs to utilize cloud and container technologies. A new standardized interface would help satisfy an increasing demand for compute-intensive jobs that combine HPC with cloud computing across a wide range of industry sectors, researchers said.

“Cloud systems are increasingly setting the directions of the broader computing ecosystem, and economics are a primary driver,” said Bronis R. de Supinski, chief technology officer of Livermore Computing at LLNL. “With the growing prevalence of cloud-based systems, we must align our HPC strategy with cloud technologies, particularly in terms of their software environments, to ensure the long-term sustainability and affordability of our mission-critical HPC systems.”

LLNL’s open source Flux scheduling framework builds upon the Lab’s extensive experience in HPC and allows new resource types, schedulers and services to be deployed as data centers continue to evolve, including the emergence of exascale computing. Its ability to make smart placement decisions and rich resource expression make it well-suited to facilitate orchestration using tools like Red Hat OpenShift on large-scale HPC clusters, which LLNL researchers anticipate becoming more commonplace in the years to come.

“One of the trends we’ve been seeing at Livermore is the loose coupling of HPC applications and applications like machine learning and data analytics on the orchestrated side, but in the near future we expect to see a closer meshing of those two technologies,” said LLNL postdoctoral researcher Dan Milroy. “We think that unifying Flux with cloud orchestration frameworks like Red Hat OpenShift and Kubernetes is going to allow both HPC and cloud technologies to come together in the future, helping to scale workflows everywhere. I believe co-developing Flux with OpenShift is going to be really advantageous.”

Red Hat OpenShift is an open source container platform based on the Kubernetes container orchestrator for enterprise app development and deployment. Kubernetes is an open-source system for automating deployment, scaling and management of containerized applications. Researchers want to further enhance Red Hat OpenShift and make it a common platform for a wide range of computing infrastructures, including large-scale HPC
systems, enterprise systems and public cloud offerings, starting with commercial HPC workloads.

“We would love to see a platform like Red Hat OpenShift be able to run a wide range of workloads on a wide range of platforms, from supercomputers to clusters,” said IBM Research Staff Member Claudia Misale. “We see difficulties in the HPC world from having many different types of HPC software stacks, and container platforms like OpenShift can address these difficulties. We believe OpenShift can be the common denominator, like Red Hat Enterprise Linux has been a common denominator on HPC systems.”

The impetus for enabling Flux as a Kubernetes scheduler plug-in began with a successful prototype that came from a Collaboration of Oak Ridge, Argonne, and Livermore (CORAL) and Centers of Excellence project between LLNL and IBM to understand the formation of cancer. The plug-in enabled more sophisticated scheduling of Kubernetes workflows, which convinced researchers they could integrate Flux with Red Hat OpenShift, researchers said.

Because many HPC centers use their own schedulers, a primary goal is to “democratize” the Kubernetes interface for HPC users, pursuing an open interface that any HPC site or center could utilize and incorporate their existing schedulers.

“We’ve been seeing a steady trend toward data-centric computing, which includes the convergence of artificial intelligence/machine learning and HPC
workloads,” said Chris Wright, senior vice president and chief technology officer, Red Hat. “The HPC community has long been on the leading edge of data analysis. Bringing their expertise in complex large-scale scheduling to a common cloud-native platform is a perfect expression of the power of open source collaboration. This brings new scheduling capabilities to Red Hat OpenShift and Kubernetes and brings modern cloud-native AI/ML applications to the large labs.”

The researchers plan to initially integrate Flux to run within the Red Hat OpenShift environment, using Flux as a driver for other commonly used schedulers to interface with OpenShift and Kubernetes, eventually facilitating the platform for use with any HPC workload and on any HPC machine.
“This effort will make it easy for HPC workflows to leverage leading HPC schedulers like Flux to realize the full potential of emerging HPC and cloud
environments,” said Dong H. Ahn, lead for LLNL’s Advanced Technology Development and Mitigation Next Generation Computing Enablement project.

The team has begun working on scheduling topology and anticipates defining an interface within the next six months. Future goals include exploring different integration models such as co-location, extending advanced management and configuration beyond the node.

Founded in 1952, Lawrence Livermore National Laboratory (www.llnl.gov) provides solutions to our nation’s most important national security challenges through innovative science, engineering and technology. Lawrence Livermore National Laboratory is managed by Lawrence Livermore National Security, LLC for the U.S. Department of Energy’s National Nuclear Security Administration.