Streamline Your HPC Setup with Intel Cluster Checker

Print Friendly, PDF & Email

Clusters can be complex systems. A team of engineers may contribute to the sucessful eventual implementation of a productive cluster, with a number of different people or organizations contributing. With many different opinions and users, a high performance cluster can become quite complex.

Creating a high performance cluster has many different steps. The requirements may include hardware uniformity, bandwidth concerns, OS standardization and many other issues. Intel Cluster Checker verifies the uniformity, performance, and functionality of Linux* OS-based clusters. If issues are found, Intel Cluster Checker diagnoses the problems and may provide recommendations on how to repair the cluster. In additon,  Intel Cluster Check offers an API so that actions can be integrated into other applications.

The sequence of steps in Intel Cluster Checker execution are:

  • Collect Diagnostic Data
  • Analyze and then Apply Rules
  • Produce a List of Found Issues
  • Suggest of Remedies

Intel® Cluster Checker is based on clinical decision support systems. With a symptom present, such as the job is running slower than expected, Intel Cluster Checker will look at a number of items, and will then suggest an action, such as killing a zombie process that is stealing cycles from the application. Intel Cluster Checker can check a wide range of issues, such as uniformity and consistency of BIOS setting. Known benchmarks with an expected performance can be run and the timing results compared to a known value. Intel Cluster Checker will also look at the entire line of Intel CPUs as well as network fabrics, operating systems, and the storage system.

[clickToTweet tweet=”Need help making sure your HPC Cluster is ready to go? Check out Intel Cluster Checker” quote=”If you want insights into the health of your cluster, check out the Intel Cluster Checker.”]Intel Cluster Checker executes in two phases. clck is a command that invokes both phases together (it is also possible to invoke them seperately and to customize their scope). The command will collect data on the cluster, and then anlayze the data in the database and produce the results of analysis. The issues and diagnoses found will have an associated severity level. For example, if a node does not contain the same amount of memory that is within a range that is specified, this will be flagged as non-uniform. In addition, simple benchmarks can be run on each node, to determine if all nodes are behaving the same, and if not, report back to the user. By default, Intel Cluster Checker verifies the overall health of the cluster using the health framework definition (a XML file that defines what data to collect and what kind of analysis to perform).

Intel Cluster Checker contains an API so that an application can check the cluster health before starting a long running application, and then can take corrective measures before using valuable cluster time. In addition, scripts can be set up to run Intel Cluster Checker periodically to see if there are issues, before users start to complain about poor performance.

Understanding a cluster can be complex if tools are not available such as Intel Cluster Checker. Think of how many times users complain that their applications are not runing with the expected performance and how long it takes system administrators to diagnose the issue. With Intel Cluster Checker, diagnosing and debugging of these issues is easier and less complex. By usingthis tool, customers will be more statisfied and a higher return on the investment will be realized.

Download Intel® Cluster Checker for free 

Download Intel® Parallel Studio XE for free