HPC System Management: Scheduling to Optimize Infrastructure

Print Friendly, PDF & Email

We continue our insideHPC series of features exploring new resource management solutions for workload convergence, such as Bright Cluster Manager by Bright Computing. This article discusses how scheduling can work to optimize infrastructure and improve HPC system management. 

HPC System Management

Download the full report.

Scheduling to Optimize Infrastructure

Whether the application is floating-point intensive, integer based, uses a lot of memory, has significant I/O requirements, or its widespread use is limited by purchased licenses, a system that assigns the right job to the right server is key to maximizing the computing infrastructure and improving HPC system management. A scheduler must be able to deal with this wide range of application requirements in order to match the cluster requirements to the application needs.

For example, if an HPC application will run best by using 128 cores, the scheduling system must recognize the needs and wait for all of the 128 cores to become available. However, if only 32 are free, should a smaller running application be assigned to those 32 cores? What then happens if other cores become free? Should the scheduler wait or assign another job? If an application is tuned for a specific instruction set architecture (ISA), should the scheduler wait for the target machine to become available, or assign to a system where the application may not run optimally?

A system that assigns the right job to the right server is key to maximizing the computing infrastructure. A scheduler must be able to deal with this wide range of application requirements in order to match the cluster requirements to the application needs.

Accelerators

hpc system management

Advanced resource management systems, such as Bright Cluster Manager, can manage clusters on-premises or in the cloud with the same set of interfaces and monitoring tools.

Many applications can take advantage of accelerators to substantially increase their performance. These accelerators are designed with hundreds of cores and their own memory system. For certain applications, accelerators have been shown to increase performance by 100X, as compared to standard CPU-based systems. Intel and Nvidia both offer hardware accelerators that when used properly can greatly reduce the time to completion for a given application.

[clickToTweet tweet=”Many applications can take advantage of accelerators to substantially increase their performance. #hpc” quote=”Many applications can take advantage of accelerators to substantially increase their performance. #hpc”]

A critical consideration for maximizing performance is assigning an application to the nodes installed with the right accelerators, as code may differ for different accelerators. The chart below shows the intelligent placing of applications on appropriate parts of a cluster.

HPC system management

Intelligent placing of applications on the cluster

A significant portion of an organization’s computing budget might be spent on the licensing of the software needed for both simulations and machine learning. For example, if the license allows for 128 cores to be used at once, a resource management system may, depending on user rules, use 32 cores for 4 different applications or wait until all 128 cores can be used together. This consideration, along with the hardware requirements, necessitates an understanding of the entire HPC and machine learning stack as well as organizational deadlines. It will dictate when and where a certain software application should be placed and executed.

The insideHPC Special Report on scheduling solutions for easier workload convergence and HPC system management in data centers will also cover the following topics over the next few weeks:

Download the full report, “insideHPC Special Report: Successfully Managing The Convergence of Workloads In Your Data Center,” courtesy of Bright Computing.