How to Manage HPC Cluster Software Complexity

This article is part of the Five Essential Strategies for Successful HPC Clusters series which was written to help managers, administrators, and users deploy and operate successful HPC cluster software.

HPC systems rely on large amounts of complex software, much of which is freely available. There is an assumption that because the software is “freely available,” there are no associated costs. This is a dangerous assumption. There are real configuration, administration, and maintenance costs associated with any type of software (open or closed). Freely available software does not eliminate these costs. Indeed, it may even increase them by requiring administrators to learn and master new packages and skills. Such a strategy assumes the administrator has picked the best package with an active development community behind it.

HPC Clusters BannerAs previously mentioned, quick custom deployments may demonstrate an initial victory, but they are vulnerable to “breakage” when change is needed. They tend to be very “guru” dependent, requiring that the person who set-up the system maintain it. Replacing gurus can be very expensive and introduce long downtimes while the entire system is reverse engineered and re-architected by yet another guru.

Another cause of complexity is the lack of integration between software tools and utilities. There are many freely available tools for managing certain aspects of HPC systems, and these include packages like Ganglia (monitoring), Nagios (alerts), and IMPItool (out-of-band management). Each of these requires separate management access through a different interface (GUI or command line). Network switches are also managed through their own administrative interface.

A key ingredient to standing up your own cluster is an expert administrator. These administrators often have a specialized skill set that includes shell scripting, networking, package management and building, user management, node provisioning, testing, monitoring, network booting, kernel-module management, etc.

There are some freely available tools for managing cluster software complexity. These include projects such as Rocks, Warewulf, and oneSIS. While these help administrators manage cluster software deployment, they do not address tool integration. That is, they help manage the tools mentioned above, but do nothing to provide the administrator with a unified view of the cluster. If there are changes beyond the standard recipes offered by these packages, skilled administrators are often needed to fine-tune the cluster. This can involve scripts that make use of syntax and semantics specific to the management software itself. An example is configuration and operation of workload schedulers.

One of the best packages for managing software complexity is Bright Cluster Manager. This professional package allows complete control of the cluster software environment from a single point-and-click interface that does not require advanced systems-administration skills. Some of the provisioning features include the ability to install individual nodes or complete clusters from bare metal within minutes. Administrators can create and manage multiple (different) node images that can be assigned to specific nodes or groups of nodes. Once the node images are in place, they can be changed or updated directly from the head node without the need to login/reboot the nodes. Packages on node images can be added or removed using standard RPM tools or YUM. Changes can be easily tracked and old node images restored to any node. In addition, node images can be configured as either diskless or diskfull with the option to configure RAID or Logical Volume Management (LVM). Bright Cluster Manager is an all-encompassing solution that also integrates monitoring and management of the cluster – including the selection and full configuration of available workload schedulers. As shown in Figure 1, complex image management scenarios can be addressed through use of the Bright GUI; a similar capability is available from Bright’s command line interface (not shown).


Figure 1 Caption: Bright Cluster Manager manages multiple software images simultaneously. Additional modifications to default-image are about to be created, and then registered for revision control.

Recommendations for Managing Software Complexity

  • Unless your goal is to learn about HPC cluster design, avoid creating your own cluster from scratch. If you choose this route, be sure your administrators have the required skill set. Also, document all changes and configuration steps for every package so other administrators can work with what you have created. This method is the most time consuming path to creating a cluster, and least likely to deliver against the success factors identified previously.
  • Consider using a cluster management toolkit such as Rocks, Warewulf, or oneSIS. Keep in mind these toolkits help manage the standard packages mentioned above, but do not provide integration. The skill set for using these systems, particularly if you want to make configuration changes, is similar to creating a cluster from scratch. These tools help reduce the work needed for cluster set-up, but still require time to “get it right.”
  • A fully integrated system like Bright Cluster Manager provides a solution that does not require advanced administrator skills or large amounts of time to set up a cluster. It helps eliminate the extra management costs associated with freely available software and virtually eliminates the need for expensive administrators or cluster gurus. This is the fastest way to stand up an HPC cluster and start doing production work.

Next week’s article will look at Recommendations to Manage HPC Cluster Growth. If you prefer you can download the entire insideHPC Guide to Successful HPC Clusters, courtesy of Bright Computing, by visiting the insideHPC White Paper Library.

Resource Links: