In this special guest feature, Tom Wilkie from Scientific Computing World marvels at the ingenuity of the engineers who are devising so many ways to keep supercomputers cool.
Cooling has become interesting – almost, in a manner of speaking, it’s cool. Hardly a week goes by now, without some announcement of a specially engineering cooling system being installed in a data centre or HPC site.
At the end of June, for example, the Centre for Biological Sequence Analysis (CBS) at the Technical University of Denmark, where the HPC system is ranked 121 in the Top500, replaced traditional air cooling with liquid cooling from CoolIT Systems to provide heat not just for the adjacent buildings but also for the nearby town of Roskilde. Cray’s recently announced machine for the UK Met Office, the company’s largest sale outside the USA, will be cooled by Motivair’s ChilledDoor heat exchanger system that fits on to the rear-doors of the racks.
The range of cooling options now available is testimony to engineering ingenuity. HPC centers can choose between air, oil, dielectric fluid, or water as the heat-transfer medium. Opting for something other than air means that single or two-phase flow could be available, opening up the possibilities of convective or evaporative cooling and thus saving the cost of pumping the fluid round the system.
Some systems were originally developed for computer gamers, to keep overclocked processors cool as they drove the game faster. Some trace their antecedents to the electrical power engineering industry. The engineering design of others has been driven by the exacting demands of HPC, but their developers hope this may open the way to an even larger market in commercial data centers for the likes of Facebook and Google.
One question: many answers
All these engineering solutions exist because there is no single, right answer to the question: “What is the best way to cool a supercomputer?” Due to differences in the price of energy, the answer will be different in Germany compared to the USA. Sometimes the answer depends on the temperature of the inlet water to the facility – if chillers have to be employed, then that will drive up the overall cost – so the answer may well be different in California compared to Sweden.
Sometimes the answer depends on another, entirely non-technical question: “Who is paying?” Often the budget for the facility – the cost of the building and its associated plant and machinery – is separate from the budget allocated to pay for the IT itself. Sometimes, capital costs may come from a different ‘pot’ than operating expenditure. Or there may be a need to install a new machine in the building that housed its predecessor, limiting the options that can be deployed because of the infrastructure that already exists.
Pumping without pumps
One of the most technologically imaginative and elegant solutions for taking heat away from the processor is that from the Belgian company Calyos. Although it was set up in 2011, its technology derives from applications developed for the power electronics industry since the 1970s. To circulate the dielectric coolant without the need for any pumps, it makes use of two principles from physics: capillary attraction and two-phase flow. A highly engineered heatsink or cold-plate sits in contact with the processor. It is made from a metal powder that has been compressed and sintered to take the form of a porous metallic foam, so it is honeycombed internally with tiny capillary ‘tubes’ through which the fluid is driven by capillary attraction. As it absorbs heat from the processor, the fluid then changes phase, and the vapor passes along a pipe to the heat exchanger at the edge of the server where the fluid condenses and the whole cycle continues.
One of the critical parameters in cooling technology is the heat transfer coefficient. For air, this parameter typically has a value between 10 and 100 W/(m2K), whereas water is more effective, with a coefficient of 500 to 10,000 W/(m2K). According to Maxime Vuckovic, sales and marketing manager for Calyos: “We can fine-tune the porous media to go from 20,000 to70,000 W/(m2K). It is not related to the design: it is the way we are able to transform the powder that gives our product so much efficiency.” The Calyos proposition, he insisted, was not just an elegant design concept, but expertise in the manufacturing process as well.
Passive cooling through a closed loop cuts out the need to provide energy to pump the working fluid, but it is also ‘naturally automatic’ so there is no need for control equipment either, and, because it is a completely sealed unit with no moving parts, there is no maintenance requirement either. Vuckovic pointed out that because the working fluid is a dielectric: ‘Even if someone breaks a tube, you don’t have any damage to the electronics. It is a metal pipe, which means it is more expensive than plastic, but it is more secure.’
The Calyos system takes the heat to the edge of the server, and it now has three systems for rejecting heat to the environment. Its first product was a bespoke water heat exchanger, but at ISC High Performance in Frankfurt, it will be demonstrating an air heat exchanger. This liquid enhanced air cooling system represents an evolution in Calyos’ thinking about the role that its technology can play.
Originally, its focus was on highly customized, high-end liquid to liquid systems and while it is continuing with this product line, it sees a potentially bigger market for an intermediary between pure air and pure liquid cooling systems.
Air cooling works efficiently on processors whose power rating peaks at about 130W or 150W; and it is probably too expensive to retrofit a data centre for liquid-liquid cooling at this level, Vuckovic explained. But above 200W, liquid cooling is needed. In between, air cooling will result in hot spots as some processors are not cooled as well as others. ‘In the middle you have some sort of a gap. To avoid the piping, you want to use air at the rack level but liquid inside the server only. You might be willing to use an intermediary solution. So we can handle high-power loads without the hot spots’. Whereas the fluid to water system was customized to customers’ requirements, the liquid enhanced air cooling system will be a standardized, commoditized product.
In this video, Olivier de Laet from Calyos describes the company’s innovative cooling technology.
Game on
Asetek and CoolIT Systems also offer closed loop systems, this time using water to cool the processors. The antecedents of both companies are in gaming, in providing a way to overclock the processors and rejecting the heat to the air at the back of the PC.
As a result of its origins, CoolIT has shipped more than two million units of its cooling system, mainly for PCs, according to Geoff Lyon, the company’s CEO and CTO, and it expects to ship between 250,000 and 300,000 this year alone. This background allows the company, based in Calgary, Canada, to benefit from economies of scale and also demonstrates the reliability of the technology.
Lyon pointed out that no-one in HPC had ever disputed that liquid cooling was efficient, but he feels it is now becoming more widely accepted. Part of the reason is that early adopters in HPC now have some years’ experience of the technology. More than 30 installations world-wide have installed CoolIT’s technology, he continued: ‘The number of our installations has more than doubled this year. We’re being run off our feet.’ He expects the HPC and datacentre markets to outstrip the desktop in importance. The gaming market is highly competitive, he remarked, and margins are low
He also said that vendor motivation is now apparent – IBM has had liquid cooling for some years, and the HP Apollo line is unique. Systems integrators are starting to specify liquid cooling in their request for proposals – especially in Europe, as a result of the price of electricity. But the advent of Big Data is a game changer, he believes: “On the surface, HPC is fast growing, but Big Data combined with HPC is a fantastic growth opportunity.”
Some of Lyon’s analysis of customer requirements mirrors that of Calyos’s Vuckovic: for instance, in the need to have an intermediary solution for those facilities that could not, or did not think it cost effective to, install an entirely water-cooled system. ‘Some people are in too big a hurry to wait for facility water supply to be plumbed in,’ Lyon said. He also noted that ‘coordinating the facilities budget and the budget for IT was not always smooth’ and the administrative split between them often determined what cooling solution was put in place.
The company developed one product line, the AHx, for its own test laboratory. It is uses the company’s Direct Contact Liquid Cooling to dissipate heat to the surrounding environment via a liquid-to-air heat exchanger at the top of the rack. It comes in two varieties that can manage 20kW or 35kW of processor load respectively, without the need of facility water.
However, he continued, almost without exception the ambition of their customers is to upgrade to a liquid-liquid system. The company therefore offers the CHx40, which provides water to water heat exchange capable of absorbing up to 40kW from one rack, and it also offers a networked system, the CHx650, system which distributes clean, treated coolant to and from many IT cabinets at once, accepting warm facility water at the inlet, and managing 650kW of processor load per network.
In this video from ISC 2015, Patrick McGinn from CoolIT Systems describes the company’s innovative liquid cooling technology for HPC clusters.
Cheap to build and to run
For Steve Branton, senior director of marketing at Asetek, it is not just the inlet temperature that is important but the outlet temperature as well. The outlet temperature of 55C means that heat can be recovered for heating the building. He cited the case of the University of Tromso in Norway where, at the Arctic Circle, there is a demand for building heating all year round: ‘so they get double use out of their electricity by capturing the heat’. In a system installed at the US National Renewable Energy Laboratory (NREL) in Colorado in 2013, the heat is used for snow clearance and building heating in the winter.
Where the heat is not being reused within the building, the fact that the Asetek system also uses warm input water, at 45C, means that it can be cooled to this temperature by dry cooling almost anywhere in the world, with evaporative cooling to the environment if need be, on demand. Dry cooling cuts in half the cost of heat rejection to the environment, he said. There is a further energy reduction of about 5 – 10 per cent depending on the configuration, he said, because ‘we can spin the fans in the servers more slowly.’
But Branton stressed that switching to the Asetek system produces savings on capital as well as operating expenditures. “Mississippi State University and the US Sandia National Laboratories chose to use our system partly because of the energy efficiency,” he said, “but also because they needed to expand the capacity of their cooling system to put in new compute. The cost of putting in dry cooling was substantially lower than putting in chillers, so they were able to buy more compute. Which is of course what matters most to the scientists – they want to get more compute for their dollar.”
Asetek is a European company; it was founded, and still has its headquarters, in Denmark. It too started in gaming, and has sold millions of units, and it too sees a bigger market now in server-based systems: ‘We have over 3,000 servers in the field cooled by our technology – all production systems not test systems,’ Branton said.
It’s very competitive: Asetek and CoolIT Systems have recently been involved in patent litigation that was resolved only in June this year, with the US District Court in San Jose, California, deciding that CoolIT Systems should pay $1,873,168 in damages to Asetek.
Asetek’s system consists of a cold plate, incorporating a water pump, sitting on top of the processor through which water flows to a rack level CDU (Cooling Distribution Unit) mounted at the back of the rack in a 10.5 inch extension that houses a liquid-to-liquid heat exchanger to transfer heat between the facility’s water and the server water. Because the cold plate is smaller, it is possible to incorporate the pump in a unit that is no bigger than the original air-cooling fins.
But direct to chip cooling removes only the heat generated by the processor itself, so Asetek has refined its range to include a hybrid liquid and air system, ISAC, that removes 100 per cent of the heat generated within a server, according to Branton. In addition to the direct cooling of the processor, ISAC also has In-Server Air Conditioners which cool the remaining, low heat-flux components without exchanging air between servers and the data centre. All the air in the server is sealed inside, and recirculates rather than mixing with the air in the data centre. The heat is extracted from the server air by a heat exchanger to be taken away by the facility’s system, along with the heat from the directly cooled processors. Again, the system is housed in a rack extension.
In this video from ISC 2015, Steve Branton from Asetek describes a series of high profile supercomputing upgrades that show the growing momentum of Asetek liquid cooling in the HPC market.
Convective cooling
Although the engineering of Asetek’s sealed sever system is very different, a similar principle is being developed by the UK based company Iceotope. Iceotope too relies on a completely sealed server to remove all the heat, but in its case, the server is totally immersed in a liquid – 3M’s Novec fluid, which is inert and, in fact, often used as a fire suppressant. Iceotope provides blades that have been designed to act as convective cells so that natural convective flow moves the fluid, and therefore the heat, from the electronics. Although this is a single-phase operation, because of the convection this stage requires no external pumps, just as Calyos’s dual-phase system eliminates pumps. However, the Iceotope system interposes a secondary loop, using a different coolant, that is pumped to heat exchangers for re-use to heat the building or rejection to the environment.
Iceotope, which is based in the UK, started out with its eyes on the high-performance computing market, introducing its Petagen system at SC14 in New Orleans, but now, Peter Hopton, founder and CVO, said: ‘I can see other markets taking an interest in our technology. In terms of cost effectiveness, our technology has a place in the Web Giant/Cloud space’. He pointed out that, although the likes of Facebook and Google have recently stressed their green credentials by putting data centres in climatically cold areas, the reality was that they needed to site their centres ‘within 50ms of latency from their customers’ – within city centres or close to city centres.
One of the reasons that market is good for us is that they look at the cost of everything holistically. Over the lifetime of installation, a third of the cost is going to be the IT; and third the facility; and a third of the cost is going to be the electricity, depending on where you are. We have a slight uplift in the IT cost, but the other two are slashed in half,” he said. In HPC in contrast, the costs are siloed, he continued: “There will be different budget holders for IT, facility, and electricity, and that can be a barrier to the adoption of more energy efficient technologies.”
Asetek’s Steve Branton had a slightly different analysis of the Web Server market: “What we see is the HPC market being the leader in adopting new energy efficiency technology, both because as a whole that market tends to embrace new technology more readily, and because the problem of the power draws/demands are more acute in an HPC centre. HPC tends to have very high utilisation, whereas in a Facebook or a web service organisation what you are worried about is handling the peaks. So a lot of time you have idle capacity, which means the CPUS aren’t working that hard. We’re still able to cool the CPUs in those cases, but the percentage drops when the servers are running idle’. He sees the Asetek system as more useful for people doing HPC or highly virtualised systems: “Those are the two markets we are focused on.”
Rugged cool computers
Asetek’s ISAC system which combines on-chip cooling with In-Server Air Conditioners aims to remove close to 100 per cent of the heat from the server. Iceotope’s total fluid immersion system offers a highly engineered route to the same end. Interestingly, another company that was originally interested in the gaming market also offers an immersion solution to remove heat.
LiquidCool Solutions was originally called ‘Hardcore Computing’ but had to change its name in part, joked CEO Herb Zien, to get its emails and website past firewalls. But it also changed its business model, so that it now licences its dielectric fluid cooling technology rather than trying to become a hardware provider in its own right. In addition to HPC, it sees markets in ruggedised computers for harsh environments – oil and gas, for example, or military applications where the silent running consequent on the absence of fans is a significant advantage.
I don’t know why it has taken the world so long to come around to liquid being the right way to do this, irrespective of the power density,” Zien said. “Air never made sense. The problem has been how to do it practically and cost efficiently – and that we have solved.”
For Zien, the value proposition was in getting rid of fans. ‘Fans are the horses of the digital age,’ he said. They are very inefficient; they take a lot of space; and by circulating air over the electronics they can create oxidation problems and bring pollutants to the electronics, in his view.
LiquidCool Solutions uses a standard commercial fluid which is therefore not expensive and is in fact very similar to the fluid used to cool transformers in the power transmission industry. The directed flow technology brings the ‘cool fluid’ in at the bottom of the server blade – it can be as hot as 45C – and directs it to a heat sink on top of the processors and then out of the heat sink so that it picks up the rest of the heat generated by the server. The result is a dramatic reduction in thermal fluctuations within the server.
The temperature difference between inlet and outlet is around 8C to 10C. ‘It’s warm enough to heat hot water, so it’s easy to recover all of the energy and use it,’ Zien said. Flow rates are a litre per minute, so relatively slow. There are no moving parts inside the server blade; the fluid is pushed round from a central pumping station. Because there is no vibration from fans, there is no fretting corrosion. Total power to cool is reduced by 98 per cent compared to air conditioning, he said.
According to Zien, immersive liquid cooling makes sense even for installations with relatively low power consumption. Because the system is completely sealed and noiseless, it opens up wider markets including baggage scanners at airports; computers in sterile areas in hospitals. Together with the integrator Dedicated Computing, they have produced the Explorer 8 for oil and gas exploration, a ruggedized portable supercomputer equivalent to 42U rack in a typical data centre.
One of our biggest entry points is modular data centers, because there the infrastructure doesn’t exist yet so we can save a lot.” He cited one version that can be put inside a C120 transport plane and go anywhere in the world, “providing 200kW of power/computing capacity – pretty powerful for an emergency computing situation.”
Air has not lost its puff
LiquidCool does not immerse hard drives. Although it is possible and the company has a patent on how to do it, in general, they sit in a dry area that is not perfused with coolant. In this respect, LiquidCool is similar to most of the cooling solutions in that neither power nor hard-drives are directly cooled.
According to Rich Whitmore of Motivair: “We design our product to remove 100 per cent of the heat from the rack.’” The new £97 million Cray supercomputer being installed at the UK Met Office will use Motivair’s ChilledDoors product to remove the heat which can reach up to 45kW per server rack.
Whitmore pointed out that: “As efficient as on-chip cooling can be, you are left with a not-insignificant amount of heart that gets rejected to the space and then needs to be cooled by traditional air conditioning. When you have these large clusters coming out now hitting 45kW per rack, if you take 30 per cent or more of that heat and put it into the room, then it is a significant challenge to air conditioners. When we put the ChilledDoor on, we are removing 100 per cent of the heat. We create this “heat neutral” environment, so that heat never leaves the server rack.” The rack cooling system is an active, rear-door heat exchanger that is mounted directly at the back of a standard server rack. It takes the heat out of the air and is capable of removing up to 75kW per rack, using cool but not chilled water.
The technology solves two problems that, Whitmore believes, tend to be underestimated in the HPC community. The first is that the doors are ‘rack agnostic’, and the second is that they are scalable. If a university is going to buy a cluster and they decide to do on-chip cooling then, when that cluster gets refreshed, the cooling system leaves with it and they are faced with another cooling dilemma, he pointed out. Since Motivair’s door is part of the rack system, ‘if they refresh and go with a different brand, the same cooling system can manage different server manufacturers. It’s not just rack agnostic, It’s OEM computer agnostic. That often gets overlooked and people are going to find this out if they go to on-chip cooling,’ he said.
The scale that is coming down the line is really quite remarkable. We see some of these chips that are coming out – some of these manufacturers can provide chips where you change the chip out and add 30 per cent to the compute power, just by changing the chip. But it creates a significant challenge to the facility – and unlimited possibilities to us – on the cooling side, because somebody has to remove that heat. We see scale as a real challenge, which is why the products we are working on are designed to scale with the clients. We see great opportunity – the heat is incredible.”
Motivair’s technology does use inlet water that is colder than the others, but Whitmore points out that it still saves energy compared to normal facility air conditioning which uses 7C water: ‘We’re up at 15 to 17C. In the UK, the difference between standard air conditioning plant and our cool water is 30 per cent in energy efficiency.’ Where free cooling is feasible, then it is 70 per cent more efficient than a standard air conditioning system, he said. The temperature rise across the door is such that the outlet heat can be reused within the building.
He concluded: “What we drive home to the market is you must be prepared for scale and you must be able to remove 100 per cent of the heat. If you can’t do that, you are not really solving the problem of cooling.”
In this video from SC13, Rich Whitmore from Motivair describes the company’s active cooling door system that can be fitted to the rear of any new or existing server rack.
This story appears here as part of a cross-publishing agreement with Scientific Computing World.