A Look Inside the AMD-HPE Blade that Drives Frontier, the World’s First Exascale Supercomputer

Frontier supercomputer

[SPONSORED CONTENT]   The new number 1 supercomputer in the world, the AMD-powered and HPE-built Frontier, is celebrated today, Exascale Day, as the world’s first exascale (a billion billion calculations per second) HPC system. Recognized at last spring’s ISC conference in Hamburg for having exceeded the exaFLOPS barrier, a display of the Frontier blade in HPE’s ISC booth was a focus of attention on the conference floor.

We thought it would be interesting to sit down with two senior officials from AMD and HPE to talk about the Frontier blade, what’s in it, its design innovations and the anticipated, long-term impacts of the blade on leadership supercomputing and on systems used by the broader HPC industry.

We interviewed Mike Schulte, senior fellow design engineer at AMD, and Bill Mannel, vice president and general manager at HPE, both of whom played critical roles in the multi-year Frontier development effort.

 

insideHPC: If you were giving a tour of the blade to someone who has knowledge of HPC but is new to Frontier, what would be the first, most impressive thing you want to show them?

Mannel: I’d want to make them aware of the engineering that went into the blade. The design is impressive. Each blade has two independent compute nodes, and each node contains one optimized 3rd gen AMD EPYC processor and four AMD Instinct™ MI250X accelerators, along with memory. Each of the four AMD GPUs has its own, dedicated 200Gb/sec Slingshot NIC, giving a total of 800Gb/sec fabric bandwidth per node. And all this is packaged in a 1U blade, which by itself is significant. It’s also completely liquid cooled, so there are no fans.

Schulte: If I had to pick just one thing to highlight about the blade it would be the AMD Instinct MI250X accelerators. Each blade has eight AMD Instinct MI250X accelerators, each one provides 47.9 teraFLOPS of peak, double-precision performance and more than 3.2 terabytes per second of peak memory bandwidth. That’s more than four times as much double-precision performance and more than 2.5 times more memory bandwidth than AMD’s prior-generation AMD Instinct MI100 accelerators1,2.

The AMD Instinct MI250X accelerator was designed as a multi-chip module design. It features two graphics compute die in a single package, and then we connect them via AMD’s high-bandwidth and low-latency Infinity fabric. So in a single package we can have up to eight stacks of HBM2e that provide 128 gigabytes of memory capacity.

The AMD Instinct MI250X accelerator is highly optimized for both HPC and AI applications. Across the two graphics compute die there’s a total of 220 compute units, which feature specialized matrix cores that provide high performance matrix operations on 16-bit, 32-bit and 64-bit floating point numbers. And with these specialized matrix cores, the AMD Instinct MI250X accelerator has over 95 teraflops of peak double precision matrix performance.

 

From an overall perspective, how does the Frontier blade compare with Oak Ridge’s preceding-generation supercomputing blades, such as Summit, installed in 2018?

Frontier blade at Oak Ridge National Laboratory

Mannel: Compared with Summit, the Frontier blade has two nodes versus the Summit blade that, in the same size package, has only a single node. And also, a single AMD MI250X accelerator has higher performance – just one of them – than an entire Summit blade. And again, there’s eight of them on the Frontier blade. Those numbers are 47.9 teraFLOPS double-precision for each AMD MI250X accelerator and a whole Summit blade is 42 teraFLOPS. Essentially, you’re looking at a 16x compute density capability versus the Summit blade.

The other thing about the Frontier blade that’s surprising is power consumption: the increase in power usage is not nearly as high as the increase in performance density. We’re getting about 7.4 times the performance but only two times the power consumption. For the entire system we’re looking at 1.1 exaFLOPS consuming 21.1 megawatts of power, versus 148.6 petaFLOPS for Summit consuming a little over 10 megawatts of power.

Looking back let’s say 15 to 20 years ago, when we looked at an exaFLOP, given the technology we had then, we used to talk about needing a nuclear power plant to power an exascale system. But with improvements in technology – in particular performance density – we’ve gotten an exaflop into just a shade over 20 megawatts.

Schulte: Along with what Bill said, if you compare Frontier with each node on Summit you need to add two IBM POWER9 CPUs and six NVIDIA V100 GPUs. Although Summit had more CPUs and GPUs per node, each node in Frontier has about four-and-a-half times more double-precision performance, about 5.3 times more memory capacity on the GPUs and about 2.4 times more memory bandwidth than each Summit node.

 

What are the big differences, advances and innovations, what are some of the key design differences in the Frontier blade?

Mike Schulte, AMD

Schulte: With the AMD Infinity Fabric™ technology, the CPUs and GPUs are on the same fundamental data fabric. This improves the overall bandwidth and latency between the CPUs and the GPUs and it also provides coherence between the CPUs and the GPUs, which improves performance.

You might think that a chip maker is focused on throughput. But when you’re designing these types of systems — that is, both the individual accelerators as well as the node — you have to make sure you’re developing a balanced system. You’re not just improving peak performance, you’re also looking at memory bandwidth and at connectivity between the different elements.

Bill Mannel, HPE

Mannel: In a large system like this you have to optimize around fabric connectivity. You’re trying to maximize efficiency across the system, which is part of the reason – in addition to the processor technology – we were successful in getting to that 1.1 exaFLOPS within that 21-odd megawatt power budget. When we did early analysis, we wanted to package 512 GPUs into a single HPE Cray EX cabinet, and this actually allowed us to optimize the fabric connectivity within the system. To get there, if you do the calculations for GPUs per node, that’s 128 nodes. To get 128 nodes into a cabinet we had to increase our power and cooling capability of the rack itself. Originally, we were looking at 300k kilowatts a rack, but we had to raise that up to 400 kilowatts per rack. To use the previous example, this is roughly 3.5X the power per square foot as the Summit system. This is all about the 100 percent direct liquid warm-water cooling with a completely fan-less architecture. That allows this level of power density.

 

Discuss the software and programmability aspect of Frontier and the continuity value between Summit and Frontier.

Schulte: Software and programmability are crucial in Frontier. The overriding goal here is enabling science, so you want the software to be accessible to a wide variety of users. Frontier’s software uses a combination of AMD’s open software. We have both the AMD ROCm™ open software platform and software for the CPUs that we provided. We work closely with HPE and the DOE national labs on optimizing software for Frontier, and HPE provides their standard, great programming environment as well.

Also, the Department of Energy has sponsored development of a variety of software tools that are geared for running on exascale systems. There’s been collaboration not only with HPE but with national labs to get software up and running well on Frontier. A big thing about us providing open software is it facilitates collaboration between AMD, HPE and the labs. We’re all able to work closely together because we all have direct access to the source and the compiler.

One of the key tools we’ve leveraged for Frontier is the HIP programming environment, the Heterogeneous Interface for Portability. With HIP, we can take code developed for Summit and run that code on Summit’s GPUs and on Frontier’s processors. This helped provide a more seamless experience between the two systems.

Mannel:   To make this whole thing tick, HPE and AMD made a large investment in programming models, including compilers, tools, libraries, etc., all that, so the customers could build the applications to get the performance that they wanted. We extended our Cray Programming Environment (CPE) specifically for Frontier, we added multiple implementations of the OpenMP standard that can be used with the popular compilers, so C, C++ and Fortran.

We also added into CPE two implementations of AMD’s HIP programming interface, as Mike explained. And it includes an enhanced suite of performance tuning tools that, for example, allow you to insert OpenMP constructs into an existing program. And then of course a key thing is debugging at scale so that you’re able to find problems with your code.

source: AMD


If there is a legacy from the Frontier blade impacting future HPC and cluster strategies, what would say it might be?

Mannel: This was the first exaflop system. Some of us had been talking about the first exaflop since we did the first petaflop. Also, if you look at the TOP500 list, Frontier by itself has about 33 percent of all the FLOPS on the list, which is phenomenal.

As for Frontier’s impact on the broader HPC industry, past leadership systems tended to be very custom, very unique, very one-off. But with Frontier, you can buy all of its components today and use them in a system of 10 nodes. This was not a unicorn, this is technology that can benefit the entire industry. So that by itself is one of the worthy objectives – performance, programmability, performance per watt, all these things can be broadly leveraged across HPC.

Schulte: The Frontier blade is historic both in terms of performance and power efficiency. It’s also important to remember the blade was used in the top four supercomputers in the GREEN500. We put a huge emphasis not only on very high overall performance while making sure the design was extremely power efficient. It’s going to be very important for future clusters to continue to emphasize efficiency because we need to continue to scale performance under challenging power constraints.

Also, within the Frontier blade the CPUs and the GPUs are co-designed to work extremely well together on HPC workloads, AI workloads and hybrid HPC-AI workloads. I believe that we’ll continue to see close coupling of CPUs and GPUs, along with making sure they’re optimized for traditional HPC applications, along with AI applications, data analytics and hybrid workloads.

Endnotes:

  1. Measurements conducted by AMD Performance Labs as of Sep 10, 2021 on the AMD Instinct™ MI250X accelerator designed with AMD CDNA™ 2 6nm FinFET process technology with 1,700 MHz engine clock resulted in 47.9 TFLOPS peak double precision (FP64) floating-point, 383.0 TFLOPS peak Bfloat16 format (BF16) floating-point performance. The results calculated for AMD Instinct™ MI100 GPU designed with AMD CDNA 7nm FinFET process technology with 1,502 MHz engine clock resulted in 11.54 TFLOPS peak double precision (FP64) floating-point, 92.28 TFLOPS peak Bfloat16 format (BF16) performance. MI200-05
  2. Calculations conducted by AMD Performance Labs as of Oct 18th, 2021, for the AMD Instinct™ MI250X and MI250 accelerators (OAM) designed with CDNA™ 2 6nm FinFet process technology at 1,600 MHz peak memory clock resulted in 128GB HBM2e memory capacity and 3.2768 TFLOPS peak theoretical memory bandwidth performance. MI250X/MI250 memory bus interface is 8,192 bits and memory data rate is up to 3.20 Gbps for total memory bandwidth of 3.2768 TB/s. Calculations by AMD Performance Labs as of OCT 18th, 2021 for the AMD Instinct™ MI100 accelerator designed with AMD CDNA 7nm FinFET process technology at 1,200 MHz peak memory clock resulted in 32GB HBM2 memory capacity and 1.2288 TFLOPS peak theoretical memory bandwidth performance. MI100 memory bus interface is 4,096 bits and memory data rate is up to 2.40 Gbps for total memory bandwidth of 1.2288 TB/s. MI200-30