Novel Liquid Cooling Technologies for HPC

In this special guest feature, Robert Roe from Scientific Computing World writes that increasingly power-hungry and high-density processors are driving the growth of liquid and immersion cooling technology.

Asperitas AIC24 immersive cooling system

New processors and GPU technology continue to demand more power than the previous generations, combining this with increasingly dense architectures requires cooling solutions such as liquid or immersion cooling. The demand for this technology is also increasing due to the use of GPU set-ups in both HPC and AI/ML environments, leading to a push for cooling technologies to support more dense and powerful systems.

The last several years have seen a move from air to liquid and some users have also moved from RDHx to more direct cooling solutions, but the rise in density is pushing requirements higher and making alternative technologies economically viable for HPC and AI type workloads.

Elizabeth Langer, R&D engineering manager, thermal business at CPC, comments on the rise of liquid cooling and the trends that CPC sees in the current market: “With the increase of computing power in smaller and smaller spaces, the density of the processing and intensity of the heat that it is producing is unprecedented.”

Air cooling is no longer a viable option because it simply is not as efficient as liquid cooling. It’s physics. Water is 24 times more efficient at transferring heat than air is; for the same volume, water can hold 3,200 times more heat than air can. The demands of the computing power and cooling needs in racks require the use of liquid cooling solutions.”

CPC is a company that specializes in the development of components and solutions for liquid cooling technology. They are seeing increased use of their technology in the HPC and AI markets.

CPC is predicting robust growth in liquid cooling needs based upon growth trends in HPC,’ added Langer. “With today’s supercomputers, it is now possible to solve previously unsolvable questions. On the non-academic side, driven by market forces to win with customers through hyper customization, there is greater “invisible” integration of modeling and predictive analysis into everyone’s lives. Demands for convenience, accuracy and speed are at the fore. So, on the two fronts of what is possible and what is being demanded, requirements for computing power will only continue to increase.”

Immersion cooling companies such as Asperitas and TMGcore are also experiencing similar demand for immersion-based solutions in HPC and AI.

Jake Mertel, chief technology officer at TMGcore, details the company’s rise to develop HPC and AI two-phase immersion cooling technology. The company had originally developed a solution called ‘Everest’ for blockchain applications.

Mertel said: “During the time that we developed that product, primarily for our own internal use, we took that experience and turned it into something that we call “Otto”. That is a two-phase immersion cooling datacenter platform designed to provide a fully modular cooling, power, communications, and autonomous server management functionality in a packaged form factor. HPC is becoming more and more prevalent, we see a lot of focus on “the edge” and the use of both learning and inference on the edge. The deployment of what would have traditionally been types of server workloads that would only exist in the kind of hyperscale clusters in certain portions of the datacenter that were very expensive to cool and very expensive to maintain and operate. Now we are seeing demand for applications like that very close to end-users so the type of GPU oriented applications can take place closer to where people are using them, reducing latency and therefore providing better user experience.”

Maikel Bouricius, marketing manager for Asperitas, a company that specializes in single-phase immersion cooling, has also noticed the demand for compute closer to the user: ‘What they demand now is a lot of compute and in as small of a space as possible. That is driven by different developments such as space constraints. For example, in dense urban environments like Amsterdam, space for datacenters is scarce at the moment so the less space they can use the better.’

However, Bouricius also notes that immersion provides a platform that is hardware agnostic and this is also driving new customers to explore this technology in a time of hardware diversity in HPC.

I think a core element of our proposition where this diversity of hardware, scenarios and applications is not dependent anymore on the facility technology and the cooling technology as well,” he said. “You do not have to change the cooling technology anymore for different kinds of hardware components. For us it does not matter whether it is a GPU, CPU or FPGA. For us, as a solution provider, it does not matter at all and I think that is the main driver of immersion technology compared to the other liquid-based technologies.”

Bouricius also noted that, while liquid cooling technology was driven by cooling certain components in a server or a rack. In the case of Asperitas, or in immersion cooling in general, the advantage is flexibility for the end-user while they are still experimenting with diverse types of hardware. ‘In that scenario then immersion is ideal because it is basically a datacenter in a box,’ he added.

Addressing sustainability

WIth increasingly beefy computing systems and power consumption rising in both new CPUs and GPUs the question of sustainability of datacenter technologies is still in question. As Langer notes, this is not necessarily the primary need of people building a new cluster.

The question of sustainability often takes a back seat to economics in the real world. The tipping point of datacenter using liquid cooling is happening now because it is more efficient (and subsequently cheaper) to cool with liquid cooling. Early campuses did not require liquid cooling as the heat generated by the computing simply wasn’t there,” stated Langer.

“Converting sites from air cooling to liquid cooling can cost thousands of dollars so the ROI has to be there. The infrastructure that has been built to accommodate air cooling systems is not necessarily ideal for liquid cooling. Raised floors have weight limitations after all so in upgrading to a liquid cooling system, the weight of each element is a consideration. CPC has developed a line of lightweight fittings specifically for liquid cooling thus enabling data centers to keep the rack density they are accustomed to,” she (Elizabeth Langer) added.

CPC’s develops fittings or quick disconnects (QDs) which are purpose-built and designed for liquid cooling of electronics. ‘CPC’s robust LQ products feature a proprietary valve design that delivers optimal flow,’ stated Langer.

Any liquid cooling implementation, regardless of manufacturer, will save customers money because the systems are more efficient, but end customers should care about QD reliability and ease of use as well.

“If a site currently is not running with liquid cooling, conversion to liquid cooling is most likely on the facilities master plan.

“Research centres have expert teams evaluating capital investment and operational expenses to better understand overall TCO.”

Sustainability was a primary concern for Asperitas as it aimed to develop its technology not just for HPC but also hyperscale, which requires much higher efficiency than a typical HPC installation.

“It is a combination of different elements. We started to develop the solution for scale and that is very different than solution for HPC and supercomputing, which is always a custom solution that is always a niche project. You might do one or two projects a year as a solutions provider,” explained Bouricius.

“In our case, the end game that we had in mind was a solution that could also be used by Hyperscale cloud providers. That was the extreme-scale that we developed this for.”

The Asperitas team achieved increased efficiency by designing the system around natural convection of the liquid-based on temperature than using pumps. ‘This means that the cooling fluid is not moved around by pumps or any other mechanical force but by natural convection,’ added Bouricius.

“This makes it a very stable and reliable solution. We do not mix the fluids because this results in a lower average temperature. We have a layering of temperatures where all the components that need to be cooled are at a lower temperature and in the top layer of the system the fluid temperature is quite high. This allows for the solution to be rather independent of climate, because we can cool the system with a water temperature of 45 degrees and we can still optimize for other temperatures.”

Cooling the next generation

Mertel and his colleagues at TMGcore decided to embrace the high-performance potential for two-phase immersion cooling technology, focusing on highly dense solutions and trying to reduce overall system size.

“Our current generation our products focus on high-density applications. The most powerful blade server that we have developed is 6,000 watts. It has 16 V100 Nvidia GPUs in it an absurd amount of VRAM and dual Intel Scalable Processors. It is a beast of a server and it is a 1 OIU (one Otto Immersion unit); it is our standard one unit blade server form factor,” said Mertel.

The types of densities that can be achieved through two-phase immersion really lend the technology towards servers that are able to consume large amounts of power. Today that is primarily GPU based workloads which is what a good number of HPC users require for their applications,” Mertel added.

Mertel also noted that the company is also developing their products for more generalized CPU-based systems. ‘The next generation of CPUs that are, in terms of power consumption, starting to get on the same order of watts per square centimetre (W/cm2) as a GPU. Intel has already announced a 400 watt Intel Scalable processor TDP system and there are AMD offerings today that almost hit 300 watts, which is the same as V100 GPU.’

“We know that there are going to be more powerful chips and we also know, it has been publicly announced by Nvidia, that there are going to be some architecture changes with respect to the Volta GPUs. The SXM3 interface is going to support 48V power input. Modern server architectures are designed to run at 12V. 48V is something that has become prevalent thanks mainly to the efforts going on in Open COmpute Project (OCP),” Mertel noted.

TMGcore uses 48V as its rack-level distribution voltage which gives the company an advantage when designing these systems. While some organizations would need to redevelop their power supplies or transition boards to 48V to support these technologies.

“We have been somewhat fortunate that we had the opportunity to jump into a few of those development efforts. ‘We have been able to see what boards that are being built from the ground up and optimized for immersion cooling might look like,” Mertel said.

We know that CPUs and GPUs are going to get denser and we have developed technologies that are available today which support a 500-watt chip the size of a V100 and we are working on the development of boiling enhancements that would allow us to go beyond that.”

Mertel also notes that the company is working with chip makers to try and get ‘boiling enhancements’ fitted to chips as they are designed removing some of the older technology that was designed to support air and water cooling.

“Firms building chips with integrated boiling enhancements, removing the integrated heat spreader that today would represent the primary thermal interface between the silicon and the heat sink or boilerplate. Replacing that directly with a boiling enhancement allows the chip to be directly submerged without any kind of heat sink,” Mertel concluded.

This story appears here as part of a cross-publishing agreement with Scientific Computing World.

Sign up for our insideHPC Newsletter