In this special guest feature, Ian Lumb from Bright Computing discusses the need to modernize software monitoring of HPC clusters.
In an 1883 lecture on “The Practical Applications of Electricity”, Scottish physicist Lord Kelvin stated: “… when you can measure what you are speaking about, and express it in numbers, you know something about it …” High Performance Computing (HPC), therefore, inherited a healthy predisposition towards monitoring. Fast forwarding in time to the present, monitoring HPC clusters remains topical. And while I expect we can all agree upon the ongoing relevance, it is clear that there are very different perspectives as to how monitoring should be modernized. Whereas passive monitoring using meta-toolkits may address needs temporarily, unified solutions that combine monitoring with provisioning and management deliver value on an ongoing, sustainable basis.
The Nature of Toolkits
As a set of software tools for the purpose of monitoring IT infrastructures, each toolkit has its own:
- User interface – Command Line Interfaces (CLIs) are typical, whereas (Graphical User Interfaces) GUIs are often absent
- Agent or agents for collecting metrics – each agent incorporates assumptions and biases regarding, for example, sampling
- Schema and database for organizing and storing raw as well as processed metrics
- Syntax and semantics for customization – meaning that involved scripting is typically required to ensure a useful implementation
- Inherent scalability limitations
- Inherent software dependencies – toolkits are dependent upon various software prerequisites, some of which may be fairly obscure and require building from source code
- Inherent interoperability limitations – since toolkits are implemented in isolation without a predisposition for interoperability
- Maintenance and roadmap considerations – dependent upon the engagement of its developers and other stakeholders
- Because they are designed for the monitoring of IT infrastructures in a generic way, you will need to expend additional effort to adapt toolkits for use in your HPC environment. You may, for example, undertake multiple steps to enable the add-ons that allow accelerators (e.g., Ganglia and NVIDIA GPUs) and coprocessors (e.g., Ganglia and the Intel Xeon Phi) to be monitored by toolkits.
The Nature of Meta-Toolkits
There is no question that meta-toolkits (toolkits about toolkits) allow monitoring data to be aggregated, and they might even address some of the concerns we’ve raised. For example, meta-toolkits might include:
- A GUI – a `single pane of glass’ view for metrics aggregated from a number of toolkits
- A single schema and database for organizing and storing metrics – metrics extracted from toolkits and their respective agents
- Reports – based upon, and potentially derived from, metrics aggregated from a number of toolkits
However, meta-toolkits may obfuscate:
- Assumptions and biases involved in sampling and processing – for example, was interpolation or extrapolation required to produce aggregated, summarized metrics in a seemingly consistent fashion?
- Scalability limitations – for example, the meta-toolkit may be scalable well beyond the design limitations and/or implementation realities of one or more toolkits
- Existing capabilities within a specific toolkit
I refer to the last point as the Lowest Common Denominator Effect. To fully expose this concern, consider the case of workload managers. If you are designing a meta-toolkit for monitoring, how do you account for the breadth and depth of metrics across the spectrum of workload managers? You make compromises. Almost certainly defensible, your design compromises drive support in the meta-toolkit towards lowest-common-denominator metrics available from all the workload managers you support in your meta-toolkit. With attention forcibly focused on the common subset of metrics, the meta-toolkit approach demands that differentiating metrics be ignored. Seemingly subtle, the Lowest Common Denominator Effect can result in significant oversight that is difficult to explain in hindsight.
Finally, the management burden of toolkits and meta-toolkits requires careful consideration: In addition maintaining each of the monitoring toolkits, the meta-toolkit monitoring framework requires maintenance itself, of course. The Heartbleed bug provides a compelling illustration. If you were using a monitoring framework based on meta-toolkits, you would need to:
- Assess each of the toolkits in isolation as well as the meta-toolkit itself for potential exploits relating to the SSL (Secure Sockets Layer) implementation in use
- Address the potential for exploits by patching each of the vulnerable toolkits in isolation as well as (potentially) the meta-toolkit
Ensure that the patched toolkits interoperate appropriately with the (patched) meta-toolkit
- Without even doing the math, it’s clear that there is an (n+1) management burden that escalates in proportion to the number n of toolkits in use plus one for the meta-toolkit itself.
Meta-toolkits provide an approach for monitoring HPC environments.
A First-Principles Solution
In visualizing and reporting upon aggregated data acquired from a number of disparate toolkits, meta-toolkits permit passive monitoring of HPC clusters. Being able to actively or proactively respond to monitoring data in real time, however, requires a deeper level of interoperability. Although it certainly is an option to build an interoperable monitoring framework based upon existing tools, utilities, toolkits, etc., implementation complexity and effort renders this approach untenable in practice.
Clearly an architectural alternative to the meta-toolkit approach is required. In seeking a unified solution that provisions and manages HPC clusters, in addition to monitoring them, unencumbered requirements analysis suggests that:
- All cluster-management capability must be provided by a single, lightweight agent
- All configuration and monitoring data must reside in a single database according to an appropriate schema
- All cluster-management capability must be equivalently accessible via a CLI or GUI
In November 2013, the Swiss National Supercomputing Centre (CSCS) conducted a technical evaluation of a dozen offerings for provisioning, monitoring and managing HPC clusters. In seeking a unified solution they stated: “Bright Cluster Manager offers a mature monitoring framework for existing resources. It allows monitoring current or past problems and collects trends that help the administrator to predict prospective issues. It can trigger alerts when certain thresholds are exceeded and can also launch an immediate action.” According to CSCS, Bright delivers a modernized monitoring solution for HPC. By unifying monitoring with management, CSCS highlighted that metrics provide the basis for action.
If your needs are exclusively for monitoring your HPC environment, a meta-toolkit may suffice. However, if you seek a more comprehensive and future-proofed solution for monitoring that also includes provisioning and management capabilities, you need a unified solution that has these integrated capabilities architected in from the outset.