Plug and Play Liquid Cooling for AI and HPC

Print Friendly, PDF & Email
liquid cooling

This sponsored post from Asetek explores how increasing AI and HPC workloads, as well as accompanying components like CPUs and GPUs, are heating up computing and bringing liquid cooling to the forefront. 

The increasing need for HPC-style configurations in support of artificial intelligence (AI) workloads was clearly seen at this year’s supercomputing conference in Dallas, Texas (SC18).

liquid cooling

Liquid cooling is required for the highest performance AI and HPC systems. (Photo: Asetek; Intel compute module with Asetek LAAC liquid cooler)

A key driver for this trend is the accelerating evolution of AI workloads in moving from traditional two step architectures. AI leaders are adopting a more dynamic model replacing the prior approach of a training phase with high compute workloads followed by an implementation phase. This model includes real-time training, enabling the further optimization of algorithms dynamically while on-going usage is occurring.

This all translates to HPC and the latest AI clusters running at 100 percent utilization for sustained periods. Complicating this is that the application’s execution is always compute limited. Hence, cutting edge AI and HPC clusters require the highest performance versions of the latest CPUs and GPUs. Coming along with the throughput, these components bring high heat loads. The wattages for NVIDIA’s Volta V100 GPU are currently at 300 watts, and both Intel’s Xeon Scalable Processors (Skylake) CPU and Xeon Phi (KNM) MIC-styled GPU have been publicly announced at 205 and 320 watts, respectively.

These chip wattages translate into substantially higher wattage densities at both the node-level and rack-level, not simply because of the component wattages alone, but also due to the requirement for the shortest possible signal distance between processors, GPUs and switches both in and between cluster racks. These factors are driving cluster racks well beyond 50kW to 80kW or higher.

Liquid cooling becomes an absolute requirement in these circumstances. Racks with air heat sinks struggle to handle the heat to maintain this maximum throughput, and CPU throttling occurs due to inefficient air cooling. Particularly for HPC clusters, reducing rack densities can mean increasing interconnect distances, resulting in greater latency and lower cluster throughput. As a result, liquid cooling is required for the highest performance AI and HPC systems.

A difficulty for operators is how to bring liquid cooling into their data centers in a managed approach with minimal disruption. Because many of the liquid cooling approaches are one-size-fits-all solutions, it can be difficult to move to liquid cooling on an as-needed basis. What is required is an architecture that is flexible to a variety of heat rejection scenarios.

Asetek’s Direct-to-Chip (D2C) liquid cooling provides a distributed cooling architecture to address the full range of heat rejection scenarios. It is based on low pressure, redundant pumps and sealed liquid path cooling within each server node.

Unlike centralized pumping systems, placing coolers (integrated pumps/cold plates) within server or blade nodes, with the coolers replacing CPU/GPU air heat sinks to remove heat with hot water, provides flexibility in heat capture. This distributed pumping is the foundation for flexibility on the side of heat-capture by being applicable to different heat rejection requirements.

The Asetek architecture allows for managed incorporation of liquid cooling in the data center. Importantly, its heat rejection options provide adaption to existing air-cooled data centers and a path to fully liquid-cooled facilities.

Because many of the liquid cooling approaches are one-size-fits-all solutions, it can be difficult to move to liquid cooling on an as-needed basis. What is required is an architecture that is flexible to a variety of heat rejection scenarios.

Adding liquid cooling with no impact on data center infrastructure can be done with Asetek’s InRackLAAC, a server-level Liquid Assisted Air Cooling (LAAC) option. With InRackLAAC the redundant liquid pump/cold plates are paired with a shared HEX (radiator) in the rack. Via the HEX, the captured heat is exhausted into the data center. InRackLAAC places a shared HEX with a 6kW 2U chassis that is connected to a “block” of up to 12 servers. Existing data center HVAC systems continue to handle the heat.

When facilities’ water is routed to the racks, Asetek’s 80kW InRackCDU D2C can capture 60 to 80 percent of server heat into liquid. (Photo: Asetek)

Multiple computing blocks can be used in a rack. InRackLACC allows incorporation of the highest wattage CPUs and GPUs, and racks can contain a mix of liquid-cooled and air-cooled nodes.

When facilities’ water is routed to the racks, Asetek’s 80kW InRackCDU D2C can capture 60 to 80 percent of server heat into liquid, reducing data center cooling costs by over 50 percent and allowing 2.5x-5x increases in data center server density. Because hot water (up to 40ºC) is used to cool, it does not require expensive HVAC systems and can utilize inexpensive dry coolers.

With InRackCDU, the heat collected is moved via a sealed liquid path to heat exchangers for transfer of heat into facilities water. InRackCDU is mounted in the rack along with servers. Using 4U, it connects to nodes via Zero-U PDU style manifolds in the rack.

Asetek’s distributed pumping architecture at the server, rack, cluster and site levels delivers flexibility in the areas of heat capture, coolant distribution and heat rejection.

Visit Asetek.com to learn more.