Challenges to Managing an HPC Software Stack

Print Friendly, PDF & Email

This is the second article in a four-part series that explores using Intel HPC Orchestrator to solve HPC software stack management challenges. The complete report, available here, outlines some of these challenges in detail, and explores the benefits of Intel’s product that extends OpenHPC. 

HPC software

Download the full report.

It takes a lot to manage a productive HPC cluster. It’s costly, so maximizing system utilization and demanding high uptimes is a priority. The HPC software stack tends to be complicated, assembled out of a diverse mix of somewhat compatible open source and commercial components. Finding and installing the best middleware components for administering resources, supervising job workflow and scheduling, and then tracking software updates, testing, and deploying quickly becomes unmanageable. Meanwhile, HPC experienced engineers and administrators, capable of navigating these complexities, are in high demand and short supply.

The sheer number of independent components, and the necessary expertise that goes along with each, can severely impact HPC software system administrative overhead. With so many components on their own rapid release cycles, the time-consuming effort to keep everything synchronized and running together smoothly risks system stability and taxes HPC system staff. Resolving many unique and possibly unknown component interdependencies requires a level of extensive testing that might not be achievable for most system administrators.

[clickToTweet tweet=”HPC experienced engineers and administrators, are in high demand and short supply. #HPCSoftware ” quote=”HPC experienced engineers and administrators, are in high demand and short supply. #HPCSoftware”]

Likewise, as user workloads become more diverse, current configurations and tools fail to fit all needs. No longer the province of just science and engineering, HPC systems, and the challenges that go with them, are now a growing part of  business and financial applications that employ big data and machine learning solutions. Integrating the environments and tools specialized for accommodating diverse data-intensive workloads into traditional HPC frameworks adds even more challenges.

Ensuring that applications are running correctly and optimally on the latest hardware creates its own challenges. Some workloads might not perform as well or fail to take advantage of the advanced features of the latest hardware, requiring in-depth analysis of complex user applications to get the best performance. Achieving this requires additional staff expertise and tools focused on application and cluster performance analysis and tuning.

As a system grows, its sheer size and scalability, along with choices made early on regarding node configuration, fabric, and storage can actually degrade overall performance.

As a system grows, its sheer size and scalability, along with choices made early on regarding node configuration, fabric, and storage can actually degrade overall performance. The challenge here is to stay on top of system performance by constant monitoring of system health and performance, and taking measures to resolve issues quickly. To achieve efficient and balanced utilization of an HPC system requires scalable capabilities for provisioning software and managing dynamic workloads across thousands of nodes.

It’s been said that HPC is all about extremes, which rings truer now more than ever before.

Over the next few weeks, this series will also cover the following topics:

You can download the “insideHPC Special Report: Intel HPC Orchestrator Solves HPC’S Software Challenges,” courtesy of Intel.