Strategies for Managing High Performance GPU Clusters

Print Friendly, PDF & Email

As of June 2015, the second fastest computer in the world, as measured by the Top500 list employed NVIDIA® GPUs. Of those systems on the same list that use accelerators 60% use NVIDIA GPUs. The performance kick provided by computing accelerators has pushed High Performance Computing (HPC) to new levels. When discussing GPU accelerators, the focus is often on the price-to-performance benefits to the end user. The true cost of managing and using GPUs goes far beyond the hardware price, however. Understanding and managing these costs helps provide more efficient and productive systems.

Download the insideHPC Guide to Managing GPU Clusters

Download the insideHPC Guide to Managing GPU Clusters

This is first article in series on ‘managing GPU clusters.’ You can read the entire series or download the complete insideHPC Guide to Managing High Performance GPU Clusters courtesy of NVIDIA and Bright Computing.

The Advantages of GPU Accelerators

The use of NVIDIA GPUs in HPC has provided many applications an accelerated performance beyond what is possible with servers alone. In particular the NVIDIA Tesla® line of GPUs are designed specifically for HPC processing. Offering up to 2.91 TFLOPS of double precision (8.74 TFLOPS using single precision) processing with ECC memory they can be added to almost any suitably equipped x86_64 or IBM Power 8 computing server.

With the support of the NVIDIA Corporation, an HPC software ecosystem has developed and created many applications, both commercial and open source, that take advantage of GPU acceleration. The NVIDIA CUDA® programming model along with OpenCL and OpenACC compilers have provided developers with the software tools needed to port and build applications in many areas including Computational Fluid Dynamics, Molecular Dynamics, Bioinformatics, Deep Learning, Electronic Design and Automation, and others.

The Challenges Presented by GPU Accelerators

Any accelerator technology by definition is an addition over the baseline processor. In modern HPC environments, the dominant baseline architecture is x86_64 servers.  Virtually all HPC systems use the Linux operating system (OS) and associated tools as a foundation for HPC processing.  Both the Linux OS and underlying x86_64 processors are highly integrated and are heavily used in other areas outside of HPC — particularly web servers.

Virtually all GPU accelerators are added via the PCIe bus. (Note: NVIDIA has announced NVLink™ a high-bandwidth, energy-efficient interconnect that enables ultra-fast communication between the CPU and GPU, and between GPUs.) This arrangement provides a level of separation from the core OS/processor environment. The separation does not allow for the OS to manage processes that run on the accelerator as if they were running on the main system. Even though the accelerator processes are leveraged by the main processors, the host OS does not track memory usage, processor load, power usage or temperatures for the accelerators. In one sense the GPU is a separate computing domain with it’s own distinct memory and computing resources.

From a programming standpoint, the tools mentioned above proved an effective way to create GPU based applications. In terms of management, however, the lack of direct access to the accelerator environment can lead to system-related concerns. In many cases, the default management approach is to assume GPUs are functioning correctly as long as applications seem to be working. Management information is often available separately using specific tools designed for the GPU. For instance, NVIDIA provides the nvidia-smi tool that can be used to examine the state of local accelerators. Monitoring GPU resources with tools like nvidia-smi and NVIDIA’s NVML provide administrators with on-demand reports and data, however, information is often extracted using scripts and sent to a central collection location.

Another challenge facing GPU developers and administrators is the ongoing management of the software environments needed for proper operation. There are several reasons for this situation. First, new versions of the NVIDIA CUDA software and drivers may offer better performance or features not found in the previous version. These new capabilities may need to be tested on separate machines within the cluster infrastructure before they can placed into production. Second, some HPC clusters may have multiple generations of GPU hardware in production and need to manage different kernel versions for specific hardware combinations. Both cluster provisioning and job scheduling must take these differences into account. And finally, there may be specific HPC applications that require specific kernel/driver/CUDA versions for proper operation.

These challenges often create administration issues or “headaches” when trying to manage HPC clusters. Each new combination of hardware and software creates both a monitoring and tool management challenge that often reduces the system throughput. Users and developers find managing tools tedious and error prone, while administrators need ways to make sure the applications are running successfully on the right hardware.

Creating a GPU Computing Resource

The advantages and challenges of GPU accelerators have presented users and vendors with the opportunity to develop a set of best practices for maximizing GPU resources.  As will be described in this paper, there are sound strategies that will help minimize the issues mentioned above and keep users and administrators focused on producing scientific results. The goal is to transform a collection of independent hardware components and software tools into an efficiently managed production system.

Next week we’ll publish an article on ‘Best Practices for Maximizing GPU Resources in HPC Clusters’ including:

If you prefer you can download the complete insideHPC Guide to Managing GPU Clusters courtesy of NVIDIA and Bright Computing.