Applying Cloud Techniques to Address Complexity in HPC System Integrations

Print Friendly, PDF & Email

Arno Kolster is Principal and Co-Founder of Providentia Worldwide.

In this video from the HPC User Forum at Argonne, Arno Kolster from Providentia Worldwide presents: Applying Cloud Techniques to Address Complexity in HPC System Integrations.

The Oak Ridge Leadership Computing Facility (OLCF) and technology consulting company Providentia Worldwide recently collaborated to develop an intelligence system that combines real-time updates from the IBM AC922 Summit supercomputer with local weather and operational data from its adjacent cooling plant, with the goal of optimizing Summit’s energy efficiency. The OLCF proposed the idea and provided facility data, and Providentia developed a scalable platform to integrate and analyze the data.

Through real-time analysis, cooling plant operators can understand the cause of temperature fluctuations in Summit’s water supply—whether due to a busy computational workload, a hot day, or a mechanical issue—and better predict the optimal settings for increasing energy efficiency.

Summit can process up to 200 quadrillion calculations per second at full numerical precision and more than a quintillion calculations per second at half precision, making it a state-of-the-art system for both modeling and simulation and artificial intelligence. With its unprecedented performance, Summit is also the most energy-efficient supercomputer in its Green500 class—based on gigaflops per watt—outranking systems a 10th as fast.

As we build faster and faster supercomputers, their energy efficiency becomes more and more important,” said Jim Rogers, computing and facilities director for ORNL’s National Center for Computational Sciences. “We wanted to couple Summit’s mechanical cooling system with its computational workload to optimize efficiency, which can translate to significant cost savings for a system of this size.”

On each Summit node, IBM’s baseboard management controller (OpenBMC) provides real-time data readings from dozens of sensors equipped by Summit’s Power9 processors and NVIDIA GPUs, totaling more than 460,000 metrics per second that describe power consumption, temperature, and performance for the entire supercomputer. Although these data streams are not specifically designed for the purpose of controlling Summit’s cooling system, Rogers recognized early on that they could inform Summit’s cooling operations.

Although Summit is highly efficient, a supercomputer uses a lot of energy. Consuming up to 13 megawatts of electricity (though often much less), Summit can require enough electricity to power several thousand homes, and this energy is converted into waste heat. One of the mechanical innovations that makes Summit so energy efficient is its “warm-water” cooling system. To extract waste heat, a loop of water absorbs heat from the system and transfers it to the cooling plant.

Every minute, about 3,300 gallons of room temperature water (about 71 degrees Fahrenheit or 22 degrees Celsius) is delivered to each computer cabinet via overhead pipes. The water flows first through a rear door heat exchanger, then through custom cold plates on each of Summit’s 4,608 nodes, transferring waste heat to the water, which is returned to the cooling plant at 85–94 degrees Fahrenheit (29–34 degrees Celsius). At the cooling plant, this warm water is pumped through a series of plate and frame heat exchangers that transfer the waste heat to a cooling tower water loop. From there, the process starts over.

The mechanical pumps and cooling tower fans use some electricity, but overall, the use of warm-water cooling saves on the energy bill.

When the cooling plant was built, we had to make some assumptions about Summit’s energy consumption,” said David Grant, ORNL lead mechanical design engineer for the Summit cooling plant. “We knew that a water supply temperature of 71 degrees Fahrenheit at a specific flow rate would always work, but it’s not necessarily the most efficient choice. A temperature of 71 degrees Fahrenheit provides us with a bit of a safety margin to handle worst-case scenarios of load and weather. Even small increases in the supply water temperature can noticeably increase efficiency and reduce the plant’s electrical cost.”

With the right data, ORNL can safely optimize the efficiency of Summit’s cooling by adjusting the water temperature and flow. Elevating the supply water temperature and lowering the flow of water lead to minimizing backup chilled water use.

Knowing the extent to which we can turn these knobs is what the intelligent cooling system enables,” Grant said.

Because Summit’s energy consumption is related to its computational workload, OLCF staff members who monitor the jobs running on the system had the data they needed to optimize Summit’s cooling process—but not the capability to feed Summit data to the mechanical plant. Rogers had a vision to integrate the Summit, cooling plant, and local weather data, but the OLCF first needed a unique framework to capture, analyze, and visualize all these data streams at once.

We needed an information hub that would allow us to integrate and analyze the data. And Providentia’s expertise is in real-time analytics systems,” said the OLCF’s Woong Shin, a high-performance computing systems engineer who managed the technical tasks in the project and worked closely with Providentia.

Summit Supercomputer at ORNL

Providentia built a framework to pull from four main data sources: per-second sensor data (from Summit’s OpenBMC boards on each node), jobs data at 15-second intervals (from Summit’s scheduler), the cooling plant’s Programmable Logic Controller (a digital tool to control its operations), and local weather data from the National Oceanic and Atmospheric Association.

The raw source data is captured in real time at a rate of 460,000 or more metrics per second on a central message bus and marshaled using tools that are not commonly applied in high-performance computing.

We’ve developed the infrastructure architecture to scale to millions of events per second using containerized microservices and popular enterprise open-source software,” said Providentia’s Arno Kolster. “New disparate data sources can easily be added to provide an even more comprehensive systems-level view of performance and power usage.”

Facility staff can now visualize Summit behavior across all 4,608 nodes with a temperature heat map, a power consumption map, and power and consumption data broken down by CPUs and GPUs.

“Now we get a whole picture of Summit instantly,” Shin said.

See more talks from the HPC User Forum

Check out our insideHPC Events Calendar