Zero to an ExaFLOP in Under a Second

Print Friendly, PDF & Email

Matthew Ziegler, Lenovo

Much like racing cars, servers do “time trials” to gauge their performance relative to a given workload.  There are more Spec or Web benchmarks out there for servers than there are racetracks and drag strips.  Perhaps the most important measure is the raw calculating throughput that a system delivers: FLOPS, or Floating-Point Operations Per Second.  It is the Miles Per Hour of system measurements: Simple, effective, and the one everybody knows.  In the past this metric was solely the measure of the CPU performance, but today GPUs are now acting like turbo-boosters for system performance.

Once the petascale (1015 FLOPS) hurdle was finally crossed back in 2008, a lot of attention and excitement turned toward guessing when and how the exascale (1018 FLOPS) barrier would be shattered.  At the time, the enormous power, cooling, datacenter floor space, overall cost and component fallout that would occur posed significant roadblocks. Therefore, major advancements in technology across the board would be needed from processors to memory to interconnects to system design itself.

Concurrent to this, the emergence of Artificial Intelligence (AI) in recent years has accelerated the requirement for developing systems capable of handling massive real-time data ingestion, with considerable computing power to create neural networks for deep learning.  But what technology exists that could process massive amounts of data and do it quickly?  Enter the co-processor, with the most common co-processor being Graphical Processing Units (GPUs). GPUs got their start in gaming, offloading the graphical output from the CPU, but later migrated their way into high-performance computing (HPC) data centers as a way of handling the huge scientific datasets.  GPUs were then found to be ideally suited for performing the matrix multiplication calculations used to create neural networks at lower precision (efficient) versus CPUs which are designed for their broad ability to handle different kinds of operations (flexible).  It’s no coincidence that as AI has exploded, so has the GPU market — both in capabilities and volume.

A traditional datacenter server is designed around the CPU and motherboard, and the peripherals are often afterthoughts.  A GPU is considered a peripheral device that was not designed to be densely packed.  One or two GPUs per server were sufficient.  Peripheral devices were also designed to install directly onto the motherboard itself in hard-wired PCIe slots. As demand for AI skyrocketed, so too did the demand for systems that could house up to four and even eight GPUs. The question immediately went from, “What GPU does your system support?” to “How many GPUs does your system support?”

The shift toward volume packaging of GPUs presented significant design challenges.  GPUs require a lot of power and are subsequently difficult to cool.

This GPU density will drive 1 U server to over 1KW with 2U servers breaching 3KW of power consumption.  This massive increase in power delivery and heat concentration required new thinking about how servers can and should be designed.

At Lenovo, we decided early on to use a dual-pronged approach.  The first was to leverage the longevity and deep skills in liquid cooling that made our Lenovo Neptune™ brand an industry leader. Given the thermal design points and power requirements, the heat density generated would require some form of liquid cooling to maintain common form-factors.  Second, we needed to innovate on how we design a system that would be capable of housing multiple types of co-processors, (GPUs, FPGA’s, ASICs), in a dense package regardless of technology or provider.

The first step was to extend our Neptune™ Direct-to-Node (DTN) liquid cooling to the GPUs.  The layout was easy enough given that our base ThinkSystem SD650 is two separate two-socket nodes on a single 1U tray.  Working with our partner NVIDIA, we simply replaced one of the CPU nodes with four high-powered, board-mounted A100 HGX GPUs. The ThinkSystem SD650-N V2 uses our copper loop design to circulate warm water throughout the system to remove 95% of the heat from the CPUs (3rd Generation Intel Scalable Xeons) the GPUs, memory storage and PCIe. The system is an ideal candidate to accelerate HPC workloads through the addition of Tensor cores along with traditional CUDA cores, pushing performance to nearly 2 petaflops per rack.

Now, we realize that not everybody can, or wants to, run plumbing into their data center.  So, we set out to design a system for core AI workloads in the data center that uses liquid to augment air cooling, without running pipes anywhere.  The ThinkSystem SR670 V2 features a new 3U chassis design and includes a model that features four board-mounted A100’s and a unique Neptune™ liquid to air heat exchanger that utilizes an internal liquid loop to pull the heat away from the GPUs, sending it to a chassis based heat exchanger where that heat can be properly dissipated using large high efficiency fans..  This reduces power consumption and noise by reducing the fan speed required for cooling.

The ThinkSystem SR670 V2 also has models that house up to eight NVIDIA A100s (for machine or deep learning) or eight T4s (for inference or VDI).  These are the standard plug-in card form-factor GPUs.  We moved away from traditional peripheral connected to motherboard approach, (hard-wired slots), and instead created a separate compartment that uses flexible cables to connect the co-processor to the motherboard.  By putting co-processors in their own dedicated compartment, we can support  an incredibly wide variety of  co-processors as they come to market and focus our cooling approach to dissipating heat from a single area of concentrated heat rather than across the entire system randomly.

As we move into the Exascale era, customers will face a choice between air cooling, maintaining their traditional system footprint and running the most powerful platforms.  You will be able to choose two of the three.  At Lenovo, we are working on designs that will allow the power of exascale level computing to cascade down to users of any size.  We call this effort “From Exascale to Everyscale”.  It has already changed how we design systems, and we believe it will change data centers for the next generation.