Dell: Omnia Copes with Configuring HPC-AI Environments

[SPONSORED GUEST ARTICLE]  Computer scientist Alan Perlis once said that fools ignore complexity while pragmatists suffer it. But when it comes to the complexity of configuring HPC, AI and data analytics environments, Perlis might say Dell Technologies walks in the path of geniuses: they remove complexity.

The convergence of HPC-AI is driven by a proliferation of advanced computing workflows that combine a host of different techniques. Data scientists and researchers are developing new processes for solving scientific and analytical problems at massive scale that require HPC systems.

Translate all this into configurations of servers, storage and networking as IT managers move nodes between clusters to deliver resources required for shifting workload demands – it adds up to serious complexity.

This is where Dell’s Omnia software stack comes in. Open source Omnia is built to speed and simplify deployment and management for high-demand, mixed workload environments. It’s designed to abstract away the manual steps that make provisioning complicated and cause configuration errors. It automates the deployment of Slurm and Kubernetes workload management software along with libraries, frameworks, operators, services, platforms and applications.

Dell Validated Designs are pre-configured, workload‑optimized, rack‑level systems comprised of servers, software, networking, storage and services to scale faster with an engineering‑tested solution. Validated Designs for HPC are scalable systems tested and tuned for specific vertical market applications, such as life sciences, digital manufacturing and research. Last week, Dell announced the Dell Validated Design for Government HPC, AI, and Data Analytics with Omnia for AI inferencing workloads.

Omnia, like so many good ideas in IT, sprang from a need perceived by Dell technicians a few years ago as they helped customers deploy HPC-class compute environments. In this case, it came out of work done at Dell’s HPC and AI Innovation Lab.

“It was incubated at Dell in partnership with Intel,” said John Lockman III, Dell Distinguished Engineer – Artificial Intelligence and High Performance Computing. “We were doing rapid prototyping, proof of concept work in in the HPC and AI Innovation Lab. We were standing up an example cluster and handing it over to a customer, we had them come into the lab and try out things before they bought them. We found ourselves constantly doing this over and over and then it occurred to us: ‘You know, there’s a better way to do this, we don’t have to manually be sysadmins, we can automate a lot of this process away. And that’s where it started, allowing more customers to get onto these clusters and get them going.”

Dell’s John Lockman

Since then, the Omnia idea has grown into a much larger effort at Dell, said Lockman, “with much larger goals of creating production-grade clusters on demand, and at exascale. So what we started with was building an open source framework. And we were very influenced by the concept of GitOps, or treating your infrastructure as code. And we wanted to bring in those best practices from HPC of building clusters, but start to adopt the GitOps mentality so that we can rapidly build different types of clusters.”

Crucially, Omnia is an open source project, so organizations and users within the Omnia community can leverage the lessons learned and experience gained by others encountered provisioning challenges. Lockman said more than 40 people have contributed to the Omnia project so far, with more joining all the time.

“We have a couple of dozen organizations that have contributed in some way,” he said, “whether they’re trying out code or making suggestions for how things can work better. Having an open source project like this brings these tools to everybody. For example, the national labs do a lot of very interesting things to make their systems work. An open source project like this, where all sorts of organizations can have some input, it feels like we get the best of all worlds, we start to hear a lot more than just the voice of the customer, we’re hearing the voice of all the labs, the voice of the users, to help foster the future growth of the project, to stand up clusters faster, to automate away some of the mundane processes that we have to do on a regular basis.”

On GitHub, Omnia can be found at: https://github.com/dell/omnia . Learn more about Dell HPC solutions at https://www.dell.com/HPC.