Why HPC Clusters Require Ultra-Low Latency Network Monitoring

Print Friendly, PDF & Email

By Vince Hill, cPacket Networks

High performance computing (HPC) requires an extremely high-powered network with ultra-low latency to move large files between HPC nodes quickly. IT and network operations (NetOps) teams in industries such as financial service, oil and gas, animation/3D rendering and pharmaceutical research need to monitor their networks in exacting detail to ensure they can support HPC workloads. But monitoring latency and other metrics at HPC-class performance levels creates a new set of challenges, including monitoring packets at 40Gbps and 100Gbps speeds, measuring latency at millisecond and nanosecond intervals, and detecting miniscule “microbursts” of traffic before they cause performance issues.

Let’s dig into those challenges in more detail.

Monitoring Packets at 10Gbps or Greater

As network speeds increase to 40 or 100Gbps, network monitoring tools, packet capture appliances and packet brokers will struggle to keep up unless they are specifically built for this use case. A general-purpose CPU architecture can’t capture packets at over 10Gbps without hardware assistance. The high-resolution measurement that HPC networks require often necessitates measuring key performance indicators (KPIs) like latency and jitter on each box, rather than at a central point (more on this later). This adds an extra layer of processing, which in turn adds a delay. The packet broker must be powerful enough to acquire, process and distribute packets accounting for this delay without slowing down the network. NetOps must ensure their monitoring hardware was designed for this high-speed, high-performance scenario.

Granular Latency Measurement

HPC workloads require extremely low network latency, usually less than a millisecond. Monitoring tools must measure latency to a more granular level (for example, if the HPC workloads cannot tolerate more than 2 milliseconds of latency then the monitoring tools must measure it in 1 millisecond intervals). This might seem obvious, but not all monitoring solutions are built for an ultra-low latency use case. NetOps must make sure their chosen solution is up to the challenge.

Microburst Analysis

It’s unusual ­– even for HPC networks – to run at full capacity all the time. The network will have an average throughput and a maximum throughput that it will occasionally hit in short bursts. Because of this, packet capture solutions also have two speeds – a sustained capture speed that they can run at indefinitely, and a “burst” speed, which they can run at for up to a minute. 40/60Gbps sustained and 100Gbps burst is common for high-performance packet capture devices, so NetOps teams should make sure their chosen solution meets these standards.

A more complex issue is on the other end of the scale: traffic bursts that are so short – lasting a few milliseconds – they can slip past monitoring solutions that aren’t granular enough. If a monitoring device measures throughput every 10 milliseconds, and traffic spikes up past the network’s maximum allowable load for just 2 milliseconds in between measurements, the spike won’t be detected. But during those two milliseconds, some packets will get dropped. In industries like finance, where trades can be lost based on milliseconds and just a few network packets, these microbursts can have serious consequences. HPC workloads that tend to generate “bursty” traffic will need high-resolution metrics, as discussed earlier, plus the ability to analyze and determine the cause of microbursts.

To solve these issues, IT and NetOps teams should make sure their chosen solution offers the following capabilities.

  • High-End Technical Capabilities – IT must choose monitoring hardware with the capability to process packets at the required speeds without dropping. Solutions not built for high-speed networks won’t be up to the task.
  • Multiple Monitoring Points – Timing is essential for HPC workloads (especially use cases like high-frequency trading). Each network hop will add delay and bias to metrics such as latency or jitter. For the most accurate information these metrics should be measured as close to the source of the traffic as possible, rather than streaming the traffic to a central point and measuring the KPIs there. This usually means using a network TAP or packet broker that can measure latency and jitter on the box (not all products will offer this) and placing a number of physical or virtual boxes at important points around the network. This also makes it easier for IT to find the root cause of problems. If a certain flow has high latency and IT measures latency at three different points along that flow, they can narrow down what might be causing the delay.
  • Precise Timestamping – The ability to timestamp packets to the nanosecond level is crucial for tracking and troubleshooting latency issues after the fact. For best results, this should be done at each capture point as explained above to prevent additional network hops from affecting the timestamps.
  • Single Pane of Glass Observability – Measuring network KPIs at multiple points quickly becomes burdensome if NetOps can only access data from one device at a time. For the full benefits of multiple capture points, NetOps should also combine all important metrics for HPC workloads (including jitter, one-way and round-trip latency, microbursts and TCP session metrics) from all network monitoring devices into a single pane of glass for maximum observability. This reduces complexity, speeds up troubleshooting and ultimately decreases downtime.

For success, IT and NetOps teams must make sure they can monitor the network at speeds approaching 100Gbps, measure latency to a sufficiently granular level, and analyze the smallest of microbursts. Granular, lossless network visibility will help ensure that their networks meet the high standards that HPC workloads demand.

Vince Hill is senior technical marketing manager at cPacket Networks.