Managing Complexity in the New Era of HPC

By Bill Wagner, CEO Bright Computing

Until recently, High Performance Computing (HPC) was a fairly mature and predictable area of information technology. It was characterized by a narrow category of applications used by a largely fixed set of industries running on predominantly Intel-based on-premise systems. But over the last few years, all of that has begun to change. New technologies, cloud, edge, and a broadening set of commercial use cases in the areas of data analytics and machine learning have set in motion a tsunami of change for HPC. This is no longer a tool for rocket scientists and the research elite. HPC is quickly becoming a strategic necessity for all industries that want to gain a competitive advantage in their markets, or at least keep pace with their industry peers in order to survive.

While HPC has given commercial users a powerful set of new tools that drive innovation, it has also introduced a variety of challenges to those organizations, including increased infrastructure costs, complexities associated with new technologies, and a lack of HPC know-how to take advantage of it. The challenges introduced by this new era of HPC have given rise to new implications for how companies execute their HPC strategies, with most embarking on a steep and risky learning curve to the detriment of their IT staff and budget.

On the technology side, options have never been more prevalent. With a wide range of choices in hardware, software, and even consumption models, organizations are now faced with an array of choices. New processing elements (Intel, AMD, ARM, GPUs, FPGAs, IPUs), containers (Kubernetes, Docker, and Singularity), and cloud options (hybrid and multi-cloud) have disrupted the HPC industry, challenging organizations to pick infrastructure solutions (both hardware and software) that will be able to tackle their diversifying workloads, while seamlessly working together.

In the past, HPC clusters were built with a fairly static mindset. The notion of combining X86 and ARM architectures in the same cluster was not even a consideration. Furthermore, extending your HPC cluster to the public cloud for additional capacity was something you planned to do “down the road.” Hosting containerized machine learning applications and data analytics applications on your HPC cluster harmoniously alongside traditional MPI-based modeling and simulation applications was “on the wishlist.” Offering end users bare metal, VMs, and containers on the same cluster was unheard of, and deploying edge compute as an integral part of your core HPC infrastructure fell under the category of “maybe someday.” However, in today’s new world of HPC, IT managers and infrastructure architects are feeling the pressure to make all these things happen right now. The availability of new, highly specialized hardware and software is both enticing and intimidating. If organizations don’t take advantage of all that HPC offers, someone else will, and losing the race for competitive advantage can deal a devastating blow to businesses vying for market share.

In the days of traditional HPC, you built a static cluster and focused your energy on keeping it up and running for its lifespan.  As such, research institutions and commercial HPC practitioners alike were able to get by with building custom scripts to integrate a collection of different open-source tools to manage their clusters. But integrating tools for server provisioning, monitoring, alerts, and change management is difficult, labor-intensive, and an ongoing maintenance burden, but possible nonetheless for organizations with the human resources and skill to do so. In the emerging new era of HPC, clusters are far from static and far more complex as a result.  The need to leverage new types of processors and accelerators, servers from different manufacturers, to integrate with the cloud, to extend to the edge, to host machine learning and data analytics applications and offer end-users VMs and containers alongside bare metal servers raises the bar exponentially for organizations that contemplate a do-it-yourself approach to building a cluster management solution.

Now more than ever before, there is an increasing need for a professional, supported cluster management tool that spans hardware, software, and consumption models for the new era in HPC. Bright Cluster Manager is a perfect example of a commercial tool with the features and built-in know-how to build and manage heterogenous high-performance Linux clusters for HPC, machine learning, and analytics with ease. Bright Cluster Manager automatically builds your cluster from bare metal – setting up networking, user directories, security, DNS, and more – and sits across an organization’s HPC resources―whether on-premise, in the cloud, or at the edge―and manages them across workloads. Bright can also react to increasing demand for different types of applications and instantly reassign resources within the cluster to service high-priority workloads based on the policies you set. Intersect360 states, “Fundamentally, Bright Computing helps address the big question in HPC: how to match diverse resources to diverse workloads in a way that is both efficient today and future-proof for tomorrow.” [1]

Bright Computing highlights the transition that one organization made from their home-grown approach to Bright Cluster Manager. The Louisiana Optical Network Infrastructure―a premier HPC and high-capacity middle-mile fiber-optic network provider for education and research entities in Louisiana―made the switch from their do-it-yourself HPC management setup to Bright Cluster Manager software to provide consistency, ease-of-use, and the ability to easily extend resources to the cloud.

“LONI had previously used a homegrown cluster management system that presented a myriad of challenges including lack of a graphical user interface (GUI), daunting complexity for new employees, and proneness to out-of-sync changes and configurations,” said LONI Executive Director, Lonnie Leger. “Likewise, the do-it-yourself infrastructure we had placed constraints on end-users due to a lack of knowledge continuity concerning cluster health, performance, and capability. By leveraging a commercial solution such as Bright Cluster Manager, we now have an enterprise-grade cluster management solution that embodies the required skills and expertise needed to effectively manage our HPC environment.”

This decision to move from in-house, piecemeal open source to a fully supported commercial cluster management solution was born out of necessity for LONI. With a desire to diversify their services, they had quickly outgrown their DIY setup and HPC expertise. While expansion wasn’t impossible, it became a daunting task as internal personnel and HPC expertise were limited. This example is but one of many in the new world of HPC. As more organizations try to navigate the challenge of managing the interdependency between hardware and software, dealing with hardware problems, isolating performance degradations, and keeping up with a constant demand for changes, the need for commercially supported cluster management solutions has become more important than ever before.

All of the change taking place in HPC that breaks and broadens how we think about it makes it necessary to remind ourselves what HPC really is. Intersect360 Research defines HPC as “the use of servers, clusters, and supercomputers―plus associated software tools, components, storage, and services―for scientific, engineering, or analytical tasks that are particularly intensive in computation, memory usage, or data management.” [2] This definition is important because it recognizes that HPC can be much broader than what it has been traditionally, and with that broadening comes a whole new level of complexity. The harsh reality is that as organizations embrace a broader definition of HPC to propel their business, they must come to terms with the complexity that needs to be overcome in order to manifest it.

With Bright Cluster Manager software, complexity is automated away and replaced with flexibility. Bright builds and pre-tests a turn-key high-performance cluster from a wizard based on your specifications and instruments the cluster with health checks and monitoring, provides detailed insight on resource utilization, dynamically assigns resources to service end-user workloads based on demand, extends your cluster to the public cloud for additional resources if desired, extends to the edge for centralized management of remote resources, supports mixed hardware environments, offers bare metal, VMs or containers from the same cluster and provides command line, GUI and API based access to all functionality.

As stated by Intersec360 Research, “Data science and machine learning? Intel or AMD? GPUs or FPGAs? Docker or Kubernetes? Cloud, on-premise, or edge? AWS or Azure? Bright Cluster Manager lets users decide individually how to incorporate all of these transitions—some or all, mix and match, now or later—in a single HPC cluster environment. With so many independent trends continuing to push HPC forward, Bright Computing is aiming to be the company that helps users pull them all together.” [3]

Bright Computing helps address the big question in HPC: how to match diverse resources to diverse workloads in a way that is both efficient today and future-proof for tomorrow.

For more information about Bright Computing solutions for HPC, visit www.brightcomputing.com or email us at info@brightcomputing.com

[1] Intersect360 Research Paper: Bright Computing: Managing Multiple Paths to Innovation

[2] Intersect360 Research Paper: Bright Computing: Managing Multiple Paths to Innovation

[3] Intersect360 Research Paper: Bright Computing: Managing Multiple Paths to Innovation

Bright Computing is the leading provider of platform-independent commercial cluster management software. Bright Cluster Manager™, Bright Cluster Manager for Data Science™, and Bright OpenStack™ automate the process of installing, provisioning, configuring, managing, and monitoring clusters for HPC, data analytics, machine learning, and OpenStack environments.