Hamburg — This morning, AMD’s long comeback from trampled HPC also-ran – a comeback that began in 2017 when company executives told skeptical press and industry analysts to expect price/performance chip superiority over Intel – reached a high point (not to say an end point) with the news that the U.S. Department of Energy’s Frontier supercomputer, an HPE-Cray EX system powered by AMD CPUs and GPUs, has not only been named the world’s most powerful supercomputer, it also is the first system to exceed the exascale (1018 calculations/second) milestone.
This may not come as a surprise to many in the HPC community because the Frontier news, under embargo until 9 am CET today from the TOP500 organization that maintains a list of the world’s most powerful HPC systems, has been leaked through various sources for several days.
On Saturday, a post ran on a Reddit site Deep AMD related technology and stock discussions: “Frontier, powered by AMD Epyc CPU and Instinct GPU is the new #1 on TOP500,” including “AMD Epyc now powers 94 of TOP500 machines, up from 73 in 2021-Nov list or 49 in 2021-Jun list, AMD Instinct MI250X (GPU) powers 7 machines including #1 Frontier, #3 Lumi and #10 Adastra, Yes, all 3 new (first time ranked) TOP10 machines are powered by AMD Epyc and Instinct, AMD Epyc powers 20 of the 39 new machines,” and on the comments go on.
The Frontier Test & Development system was named the top system on the Green500 list. It’s an AMD rout.
The announcement from the TOP500, made today in Hamburg at the ISC 2022 conference, had been hot topic of speculation coming into this week. It was discussed last week on insideHPC’s panel discussion “ISC 2022: HPC Experts Share Thoughts on Making the Most of Next Week’s Conference,” including a comment from HPC industry analyst Addison Snell of Intersect360 that if information on Frontier performance had not been submitted to TOP500 in time for the new list that could be an indication that the system was running into delays.
But officials from Oak Ridge National Laboratory, where Frontier is housed, and HPE, Frontier’s system integrator, stated plainly yesterday at a news briefing that the system is functioning well. Indeed, there has been rumors of problems with Frontier focused on its Cray Slingshot interconnect technology, which apparently had been bedeviling HPE and Oak Ridge technicians pulling Frontier into final form. But Oak Ridge and HPE said Slingshot is working well and that they’re confident Frontier will be ready for “full user operations” by Jan. 1, 2023.
In addition, today at a TOP500 media briefing, Jack Dongarra said that while there was plenty of 11th-hour work behind the submission of Frontier’s performance numbers to the TOP500, the system — all of it — performed as HPE and Oak Ridge had hoped.
“Standing up any new system has issues, with problems in the hardware as well as problems in the software, and the network and everything else along the way,” Dongarra told us. “It’s a new system being put together at scale. And we’re running an application on it that stresses all the components simultaneously — that’s, that’s what causes the instability. So there was instability in the Frontier system, but … it’s a natural thing. …It’s a normal thing in the course of bringing it up.”
Initially, in fact, Frontier’s HPL number came in below exascale, but system tuning continued.
“They had a lot of people intensively working on it, they knew it was a bug and they knew they would eventually find it,” Dongarra said. “So it’s a successful machine, it’s running at scale, and it’s now, I would feel, ready to run applications.”
In its announcement today, the TOP500 organization said Frontier is “the first true exascale machine.” Some may quibble with the declaration of “first” – China by many accounts has two exascale systems up and running (see “Report: China ‘Seizing’ HPC High Ground, Plans 10 Exascale Systems by 2025“), one of them for more than a year, but neither system was submitted to the TOP500 for the new version of the bi-annual list. However, it is unequivocally true that Frontier is the first system to be reviewed by the TOP500 and deemed to deliver peak performance at or above exascale – to be precise, a High Performance LINPACK (HPL) score of 1.102 Exaflop/s.
The announcement is no doubt welcome news at the U.S. Department of Energy, whose exascale development program, under the auspices of the Exascale Computing Project, began in 2016, has encountered a stumble or two along the way (see Exascale Exasperation: Why DOE Gave Intel a 2nd Chance; Can Nvidia GPUs Ride to Aurora’s Rescue?, August 2020). In fact, Frontier was not originally planned to be the first U.S. exascale system, that distinction was to have gone to Aurora, the Intel-built and powered system that was to have been installed at Argonne National Laboratory last year. But Intel encountered delays with its new GPU, code named “Ponte Vecchio,” along with less lengthy delays to the Sapphire Rapids CPU, with the result that Aurora is not expected to arrive at Argonne until later this year.
Another positive development, assuming the challenges involving the Slingshot interconnect have been completely ironed out, is that all three of DOE’s first three exascale-class systems will use the HPE-Cray EX architecture – a continuation of Slingshot problems might have delayed the entire U.S. exascale effort.
In its announcement this morning, the TOP500 said Frontier delivered a High Performance LINPACK score of 1.102 Exaflop/s. The system is power by AMD EPYC 64C 2 GHz processors with 8.7 million cores, has a power efficiency rating of 52.2 gigaflops/watt and relies on gigabit ethernet for data transfer.
The top position was previously held for two years straight by the Fugaku system at the RIKEN Center for Computational Science (R-CCS) in Kobe, Japan. Sticking with its previous HPL benchmark score of 442 PFlop/s, Fugaku has now dropped to No. 2. Considering the fact that Fugaku’s theoretical peak is above the 1 exaflop barrier, there’s cause to also call this system an exascale machine as well, the TOP500 said. However, Frontier is the only system able to demonstrate this on the HPL benchmark test.
Another change within the TOP10 is the introduction of the LUMI system at EUROHPC/CSC in Finland. Now occupying the No. 3 spot, this new system has 1,110,144 cores and has a HPL benchmark of nearly 152 PFlop/s. LUMI is also noteworthy in that it is the largest system in Europe.
Finally, another change within the TOP10 occurred at the No. 10 spot with the new addition of the Adastra system at GENCI-CINES in France. It achieved an HPL benchmark score of 46.1 Pflop/s and is the second most powerful machine in Europe, behind LUMI.
Here is a summary of the systems in the Top10:
Frontier is the new No. 1 system in the TOP500. This HPE Cray EX system is the first US system with a peak performance exceeding one ExaFlop/s. It is currently being integrated and tested at the ORNL in Tennessee, USA, where it will be operated by the Department of Energy (DOE). It currently has achieved 1.102 Exaflop/s using 8,730,112 cores. The new HPE Cray EX architecture combines 3rd Gen AMD EPYC™ CPUs optimized for HPC and AI with AMD Instinct™ 250X accelerators and Slingshot-11 interconnect.
Fugaku, now the No. 2 system, is installed at the RIKEN Center for Computational Science (R-CCS) in Kobe, Japan. It has 7,630,848 cores which allowed it to achieve an HPL benchmark score of 442 Pflop/s. This puts it 3x ahead of the No. 3 system in the list.
The new LUMI system, another HPE Cray EX system installed at EuroHPC center at CSC in Finland, is the new No. 3 with a performance of 151.9 Pflop/s just ahead of No 4. The European High-Performance Computing Joint Undertaking (EuroHPC JU) is pooling European resources to develop top-of-the-range Exascale supercomputers for processing big data. One of the pan-European pre-Exascale supercomputers, LUMI, is in CSC’s data center in Kajaani, Finland.
Summit, an IBM-built system at ORNL in Tennessee, USA, is now listed at the No. 4 spot worldwide with a performance of 148.8 Pflop/s on the HPL benchmark which is used to rank the TOP500 list. Summit has 4,356 nodes, each housing two Power9 CPUs with 22 cores and six NVIDIA Tesla V100 GPUs, each with 80 streaming multiprocessors (SM). The nodes are linked together with a Mellanox dual-rail EDR InfiniBand network.
Sierra, a system at the Lawrence Livermore National Laboratory, CA, USA, is at No. 5. Its architecture is very similar to the #4 systems Summit. It is built with 4,320 nodes with two Power9 CPUs and four NVIDIA Tesla V100 GPUs. Sierra achieved 94.6 Pflop/s.
Sunway TaihuLight is a system developed by China’s National Research Center of Parallel Computer Engineering & Technology (NRCPC) and installed at the National Supercomputing Center in Wuxi, China’s Jiangsu province, is listed at the No. 6 position with 93 Pflop/s.
Perlmutter at No. 7 is based on the HPE Cray “Shasta” platform, and a heterogeneous system with AMD EPYC based nodes and 1536 NVIDIA A100 accelerated nodes. Perlmutter achieved 64.6 Pflop/s
Now at No. 8, Selene is an NVIDIA DGX A100 SuperPOD installed inhouse at NVIDIA in the USA. The system is based on an AMD EPYC processor with NVIDIA A100 for acceleration and a Mellanox HDR InfiniBand as network and achieved 63.4 Pflop/s.
Tianhe-2A (Milky Way-2A), a system developed by China’s National University of Defense Technology (NUDT) and deployed at the National Supercomputer Center in Guangzhou, China is now listed as the No. 9 system with 61.4 Pflop/s.
The Adastra system installed at GENCI-CINES is new to the list at No. 10. It is the third new HPE Cray EX system and the second fastest system in Europe. It achieved 46.1 Pflop/s.