Sign up for our newsletter and get the latest HPC news and analysis.
Send me information from insideHPC:


Diagnose Cluster Health with Intel® Cluster Checker

Sponsored Post

If you’re administering a cluster with hundreds or thousands of nodes, how can you be sure that everything is running properly?

Even medium size clusters present administrators with a level of complexity many orders of magnitude greater than a single system. As clusters scale up, so do the potential problems for their sys admins focused on providing the highest level of productivity to users.

Is the cluster performing as expected, or are there hidden issues due to hardware/software configuration incompatibilities? Are the BIOS settings the same across all nodes? Does the cluster comply with a minimum reference architecture?  And so on. And if there is an issue impeding performance, what is the root cause and how can it be resolved?

Intel® Cluster Checker, distributed as part of Intel® Parallel Studio XE 2018 Cluster Edition, provides a set of system diagnostics and analysis methods in a single tool to assist managing clusters of any size.

Think of Intel Cluster Checker as a clinical system that detects signs that issues affecting the health of the cluster exist, diagnoses those issues, and suggests remedies. Using common diagnostic tools signs that may indicate symptoms leading to a diagnosis and a possible solution.

Intel Cluster Checker collects more than 100 characteristics about the cluster, looking for symptoms.Click To Tweet

The Intel Cluster Checker expert system assesses firmware, kernel, storage, and network settings, and conducts high-level tests of node and network performance using the Intel® MPI Library benchmarks, the STREAM* memory bandwidth benchmark, the High-Performance Linpack Benchmark, and other benchmarks. The list of checks grows with each update release.

Consistency in hardware and software over the cluster is critical for cluster performance, and this is something Intel Cluster Checker verifies. For example, it checks that the firmware on network cards is the same on each node, as well as software versions, settings, and more. It highlights any differences in configurations of both hardware and software components.

Using an API to control data collection and analysis, developers can extend capabilities by adding customized rules that inspect aspects of the system specific to an application, and inform the user of potential problems.

Besides cluster health, Intel Cluster Checker will also verify that a cluster complies with the Intel Scalable System Framework reference architecture, which establishes the minimum system requirements to enable interoperability through a common platform-application interface.

In general, using Intel Cluster Checker makes it easier for sys admins to keep large clusters in a healthy state and performing well for users.

Download your free 30-day trial of Intel® Parallel Studio XE 2018

Resource Links: