This article is part of the Five Essential Strategies for Successful HPC Clusters series, which was written to help managers, administrators, and users deploy and operate successful HPC clusters.The basic HPC cluster consists of at least one management/login node connected to a network of many worker nodes. Depending on the size of the cluster, there may be multiple management nodes used to run cluster-wide services, such as monitoring, workflow, and storage services. The login nodes are used to accommodate users. User jobs are submitted from the login nodes to the worker nodes via a workload scheduler.
Cluster building blocks have changed in recent years. The hardware options now include multi-core Intel x86-architecture worker nodes with varying amounts of cores and memory. Some, or all, of the nodes may have accelerators in the form of NVIDIA GPUs or Intel Xeon Phi coprocessors. At a minimum, nodes are connected with Gigabit Ethernet (GbE), often supplemented by InfiniBand (IB). In addition, modern server nodes offer a form of Intelligent Platform Management Interface (IPMI)– an out-of-band network that can be used for rudimentary monitoring and control of compute node hardware status. Storage subsystems providing high-speed parallel access to data are also a part of many modern HPC clusters. These subsystems use the GbE or IB fabrics to provide compute nodes access to large amounts of storage.
On the software side, much of the cluster infrastructure is based on open-source software. In almost all HPC clusters, each worker node runs a separate copy of the Linux OS that provides services to the applications on the node. User applications employ message passing libraries (e.g., the Message Passing Interface, MPI) to collectively harness large numbers of x86 compute cores across many server nodes. Nodes that include coprocessors or accelerators often require user applications to use specialized software or programming methods to achieve high performance. An essential part of the software infrastructure is the workload scheduler (such as Slurm, Moab, Univa Grid Engine, Altair PBS Professional) that allows multiple users to share cluster resources according to scheduling policies that reflect the objectives of the business.
As clusters grew from tens to thousands of worker nodes, methods for efficiently deploying software emerged. Clearly, installing software by hand on even a small number of nodes is a tedious, error-prone, and time-consuming task. Methods using network and Pre-Execution Environment (PXE) tools were developed to provision worker node disk drives with the proper OS and software. Methods to send a software image to compute nodes over the network exist for both diskfull (resident OS disk on each node) and diskless servers. On diskfull hosts, the hard drive is provisioned with a prepackaged Linux kernel, utilities, and application libraries. On diskless nodes, which may still have disks for local data, the OS image is placed in memory (RAM-disk) so as to afford faster startup and reduce or eliminate the need for hard disks on the nodes. In either case, node images can be centrally managed and dispersed to the worker nodes. This eliminates the tendency for nodes to develop “personalities” due to administrative or user actions that alter nodes from their initial configuration.
Next week’s article will look at How to Manage HPC Cluster Software Complexity. If you prefer you can download the entire insideHPC Guide to Successful HPC Clusters, courtesy of Bright Computing, by visiting the insideHPC White Paper Library.