Double-precision CPUs vs. Single-precision GPUs; HPL vs. HPL-AI HPC Benchmarks; Traditional vs. AI Supercomputers

NERSC’s “Perlmutter” AI supercomputer

If you’ve wondered why GPUs are faster than CPUs, in part it’s because GPUs are asked to do less – or, to be more precise, to be less precise.

Next question: So if GPUs are faster than CPUs, why aren’t GPUs  the mainstream, baseline processor used in HPC server clusters? Again, in part it gets back to precision. In many workload types, particularly traditional HPC workloads, GPUs aren’t precise enough.

Final question: So if GPUs and AI are inextricably linked, particularly for training machine learning models, and if GPUs are less precise than CPUs, does that mean AI is imprecise? Answer: no.

Let’s sort through this more closely. Helping us is Coury Turczn, a science writer at Oak Ridge National Laboratory who recently wrote a blog discussing the two most prominent benchmarks used for measuring supercomputer power, High Performance Linpack (HPL) and the newer HPL-AI. It’s the degrees of precision called for by the two benchmarks that holds the answer to our speed vs. precision/GPU vs. CPU questions.

In short: HPL scores HPC systems based on their double-precision (64-bit) capabilities carried out by traditional, CPU-based supercomputers, while HPL-AI benchmarking is based on single-precision (16- or 32-bit GPU) arithmetic used in data science. As Turczn stated: “These two methods of calculating arithmetic are used for different applications in computational science, and double precision is considered the ultimate standard.”

Double-precision HPL goes back to 1979 when it was introduced by Jack Dongarra, director of the Innovative Computing Laboratory (ICL) at the University of Tennessee, Knoxville. It became the main ranking method used for the twice-yearly Top500 list of the world’s most powerful supercomputers. But as GPU-driven AI has emerged from what appears to have been its last winter over the past 10-plus years, Dongarra, along with ICL colleagues Piotr Luszcek and Azzam Haidar (now with Nvidia) developed the HPL-AI benchmark.

Jack Dongarra, University of Tennessee

“Historically, HPC workloads are benchmarked at double precision, representing the accuracy requirements in computational astrophysics, computational fluid dynamics, nuclear engineering, and quantum computing,” Dongarra said. “But within the past few years, hardware vendors have started designing special-purpose units for low-precision arithmetic in response to the machine learning community’s demand for high computing power in low-precision formats.”

This doesn’t mean AI model training is imprecise, it just means that training those models doesn’t require double-precision calculations — single-precision is perfectly adequate. “Unlike high-fidelity simulations, data-science applications such as artificial intelligences or neural networks don’t always require the ultimate in 64-bit precision to accomplish their tasks effectively,” Turczn explained. “Consequently, GPU makers have been adding the ability to conduct lower precision calculations in their products, such as the Nvidia V100 Tensor Core GPUs … or the AMD Instinct GPUs… This can result in a big speed increase for those data-driven applications.”

As we said, it gets down to workload types.

“In general, when you do a simulation, you’re trying to represent the world—the locations of molecules or atoms or climate currents—in the most precise way that you can. So you want all 64 bits of double precision to represent a numeric value,” Mallikarjun Shankar, head of the Advanced Technologies Section in the National Center for Computational Science at Oak Ridge, told Turczn. “Now, in the world of data science, and for certain classes of operations, you’re often classifying or categorizing quantities or operating on a smaller set of quantities where you don’t need all 64 bits to represent the quantity.”

Frontier

The HPL-AI benchmark is a significant addition to supercomputing assessment because it measures “mixed precision” HPC systems – i.e., systems that incorporate both CPUs and GPUs, which are increasingly prevalent in advanced computing today. Not only do these systems need the CPU’s 64-bit double precision capabilities for some workloads, CPUs are used for running core systems functions and operating systems. This is why even in emerging “AI supercomputer” category, which are systems that can have thousands of GPUs, CPUs are used.

A leading example of an AI supercomputer is U.S. National Energy Research Scientific Computing Center’s (NERSC) Perlmutter supercomputer (see “6,000 GPUs: Perlmutter to Deliver 4 Exaflops, Top Spot in AI Supercomputing”), powered by 6,159 GPUs. In the first phase of Perlmutter’s installation, each GPU-accelerated node has four Nvidia A100 Tensor Core GPUs with a single AMD “Milan” CPU. In the second phase, each CPU node will have two AMD Milan CPUs. Perlmutter will deliver 4 exaflops of mixed precision performance. Said Dion Harris, Nvidia senior product marketing manager, at the system unveiling in May, “That makes Perlmutter the fastest system on the planet on the 16- and 32-bit mixed-precision math AI uses.”

Fugaku supercomputer at Riken

In noting its 4 exaflops of performance, can we then say Perlmutter is an exascale-class system? No, because the system does not deliver HPL-benchmarked double precision exascale throughput. Nor, for that matter, does the current no. 1 system on the Top500 list, Fugaku, jointly developed by Japan’s RIKEN scientific institute and Fujitsu, which – though an Arm CPU-based system – blows past the HPL-AI exascale milestone.

The first double-precision HPL-benchmarked exascale supercomputer is expected to be Frontier, powered by AMD GPUs and CPUs and scheduled to be installed later this year at Oak Ridge. Assuming Frontier exceeds HPL-benchmarked exascale, we can only wonder what its HPL-AI result will be.

Having said that, Dongarra told Turczn he doesn’t see HPL and HPL-AI as an either-or situation. “Dongarra said he doesn’t expect HPL-AI to supplant HPL but rather serve as a complement,” wrote Turczn, “bridging the gap in evaluating mixed-precision performance as the technique gains more traction in computational science.”