Getting to Exascale Day 2022 with an Exascale System Wasn’t Easy

Frontier supercomputer and Justin Whitt of the Oak Ridge Leadership Computing Facility

Finally, after 15-plus years of intellectual strain (planning), bureaucratic wrangling (budgeting), technical toil (system building) and, probably, some tears, the HPC community has arrived at an Exascale Day, October 18 (1018), on which we actually have a certified exaFLOPS supercomputer: Frontier, at Oak Ridge National Laboratory. Exascale is no longer in the future, it’s here, it’s now and it arrived on time.

It’s an unlikely story. The push to get the HPE Cray-built and AMD-powered Frontier system past the exascale milestone last spring is one of the high dramas in recent HPC memory, capping an effort that overcame a host of obstacles during the previous two years, including a worldwide pandemic and the meltdown of global supply chains.

Asked about the significance of Exascale Day 2022, industry analyst Addison Snell, CEO of Intersect360 Research, said, “Looking at the last two years, I think it underlines the success of the United States Department of Energy’s exascale, program, including the Exascale Computing Project, where from hardware to software to facilities, they’ve taken a holistic approach to the achievement of exascale.”

Holistic Exascale

By holistic, Snell refers to DOE’s “capable exascale” strategy, taking the project “beyond being a number on a benchmark,” Snell said. “What are the interconnect implications? What are what are the power consumption implications? What are the software ecosystem implications that are going to enable applications at exascale. And the openness with which all that was done…, I think really shows how this can be done very successfully.

“Certainly, the Chinese agencies are well documented having achieved exascale on applications right now,” Snell continued. “But in contrast, that effort has been quite closed… We don’t really know as much about their interconnect, the software, and the power consumption as we do with the American effort as to which one supports the broader scientific community better.”

That said, this Exascale Day has some ironies. For one, the focus of leadership-class systems strategists has already moved on to the next big thing in supercomputing, and the U.S. Department of Energy’s plan to get there embraces a different, more modular and more workload-specific approach. Rather than move toward bigger and more powerful monolithic systems, DOE means to combine shared systems across multiple locations that, in aggregate, will constitute the post-exascale era.

Which means Frontier and its exascale ilk in future years may be looked back upon as among the last of the great, stand-alone, general-purpose clusters, the culmination (at the high end) of the Beowulf movement begun nearly 30 years ago by Thomas Sterling and Donald Becker, the last of a breed.

But what a breed! The excitement generated by Frontier, a system that will take on problems of greater scale and complexity than any previous HPC system, can be seen in the acclaim cast its way on the Exascale Computing Project’s (ECP) website by scientists, national lab officials, senior managers at technology companies that built the system, industry analysts and computer scientists at academic institutions.

We reached out to several people intimately involved in the exascale project, some of whom we talked with two Exascale Days ago (“Getting to Exascale: Nothing Is Easy”) about the biggest challenges they anticipated in the coming months and years.

Frontier’s Moonshot

One of them is Justin Whitt, Program and OLCF-5 Project Director at the Oak Ridge Leadership Computing Facility, Frontier’s home. Few people have been more immersed in the Frontier project, Whitt has lived it day-in and day-out for years. His account of “the exascale moment” last May might remind some of the moment when the Apollo 11 lunar module touched down on the moon in 1969 (Mission Control to the astronauts: “You got a bunch of guys about to turn blue. We’re breathing again. Thanks a lot.”)

After Frontier was delivered to Oak Ridge last November, a challenge for Whitt, for his OLCF team and technical staffers at HPE and AMD was tuning the HPE Cray Slingshot fabric and other system components so that all of Frontier’s 74 cabinets (9,400 computing nodes) were engaged as the system ran the HPL (High-Performance LINPACK) benchmark, used by the TOP500 organization to measure a supercomputer’s throughput.

Describing the atmosphere during those final weeks leading up to the TOP500 submission deadline, Whitt said, “I would even use the word intense. We went into month’s-long, 24X7 operations. It was really a team effort where we had HPE engineers and technicians on site, and they were replacing parts or adjusting tunings on the system. AMD had engineers on site as well, helping us root cause issues that we found. And then at the same time, the Oracle application teams were just hammering the system and helping root out those early-life failures so we can get the system up quickly and … break that exascale barrier, which was truly exciting.”

Whitt told us that for the people involved in the project, they fell into the habit watching the power profiles showing how much power Frontier was using as it ran the HPL benchmark.

“When you’re running HPL on a big system it has a very distinct power profile where it ramps up and then it stays at that kind of steady state, and then it tails off in some manner,” Whitt said. “A lot of times, how it tails is really where you can get your performance gains…”

Around the world, engineers involved in the Frontier project “would stay up at night watching these power profiles. It was a little bit like a rollercoaster ride, you may see a dip in the performance or the power use and you never knew if something went wrong and if the entire HPL run failed or if it was just a momentary blip and would tick back up again…”

In the Whitt household, the suspense spread family-wide.

“I even had my wife and kids watching with me on our television…,” Whitt said. “While we ate dinner we would watch the HPL run and the power profiles. It really was down to the wire.”

Then, early one morning close to the TOP500 deadline, the team had an HPL run that, while closer to the exaFLOPs milestone, again fell short, which meant figuring out if more tuning was needed or if there’d been a more serious problem requiring hardware replacements and a system restart.

At that point, Whitt said, “I think we were a little down, and I guess it’s always darkest before the light. It was 6 a.m., about sunrise. We had one that was running and it completed. And that was the TOP500 result that broke the exascale barrier. Our Slack channels immediately just erupted with all the people that were still up watching these things and congratulating each other on really, a history making moment.”

Supply Chain Meltdown

A wonderful moment for all involved, yes, but the drama – the angst – in arriving there was caused in part by the supply chain breakdown brought on by the COVID-19 pandemic. When asked for the biggest surprise and the biggest barrier to exascale encountered over the past two years, the consistent answer is supply chain problems.

Overcoming it meant hundreds of staff hours working the phones with suppliers, a tedious yet necessary task to secure the 60 million parts (and hundreds of discreet parts) comprised by Frontier.

The problem often came up not necessarily with high-end parts, such as GPUs and CPUs, but with more mundane parts, such as voltage regulators, he said.

“That was a surprise to everyone… And it was almost exactly when we were ready to build the computer when the constraints really started to … effect the supply chain.”

In all, supply chain problems threw the Frontier project about a three month’s off schedule. That it wasn’t worse, Whitt said, is a credit to “our vendor partners, I have them to thank for that. I really haven’t been able to say enough about the true hero efforts by HPE and AMD in going out and getting the parts that we needed … and building the system and getting it here as quickly as they could.”

Whitt said at one point, it was the full-time jobs of several people to track down parts – including from competitors good-hearted enough to help out. “It took us a couple of months to get the parts, and when the last parts arrived they went right into the cabinets and those things went out the door to us. That’s really how tight that part of the schedule was at that point.”

Gelsinger Works the Phones

Jeff McVeigh of Intel

Jeff McVeigh, VP and GM, Super Compute Group at Intel, is working on delivery of the Aurora exascale system to Argonne National Laboratory. That system is scheduled to be delivered by the end of this year or the first part of next year, and as of now, Intel is delivering blades to Argonne powered by Sapphire Rapids 4th generation Xeon CPUs and Intel’s first GPU, “Ponte Vecchio.”

McVeigh and his team are dealing with the same supply chain issues that plagued the Frontier team. It’s a measure of the “all hands on deck” nature of a project like Aurora that the determination to break through supply chain bottlenecks goes all the way to the top at Intel, to CEO Pat Gelsinger.

“He’s been a huge advocate, he totally gets the importance of this,” McVeigh said, adding that Gelsinger says “’Hey, we do these things at the high end so that we can benefit across the rest of the roadmap.’ And he really understands that and is honestly pushing us to do more. ‘If we’re going to do this in the high end, make sure these are the highest end products that we can do in our roadmap.’ So that’s a very encouraging and supportive environment.”

Gelsinger returned to Intel early in 2021 after a VMware hiatus. The previous year included Intel’s announcement that Aurora would be delayed past its original 2021 delivery due date.

“Aurora obviously had been in flight well before (Gelsinger) came back,” McVeigh said, “but he’s making sure that this is delivered. Obviously, he’s been having lots of engagements with the government, the U.S. government and (he wants to show) proof that the government can stand behind Intel and that we’re going to deliver, so he wants to make sure that this is done to the right level of quality and performance.”

McVeigh said Gelsinger’s commitment to Aurora extends to pitching in when supply chain problems crop up.

“He’s with me on calls with CEOs from suppliers, making sure we get the parts we need,” McVeigh said, “so up and down, he’s involved. Now we don’t call him in on every call…, but if there’s a real challenge he absolutely will get on the call with us and we talk it through. And he helps significantly.”

Slingshot Comes in for Praise

Along with supply chain problems, the exascale effort has encountered technical challenges, such as the Slingshot fabric enabling Frontier blades to share processing, move data, to communicate. But Snell said of all the technical components of Frontier, Slingshot impresses him the most.

Addison Snell ,CEO of Intersect360 Research

“The Cray slingshot interconnect is one that a lot of people like to take snipes at,” he said. “But you know, that’s something where you need to put together an interconnect that’s going to be robust at that level of scale, it’s not easy. Other people haven’t been doing that. Sometimes you can tell who the pioneers are because they’re the ones shot full of arrows. That’s what we’ve got with with Slingshot and the amount of complexity they had to overcome, particularly during the supply chain issues around the pandemic, to deliver that, to get it running in exascale, that’s a real achievement… That’s one of the hardest pieces, and if it’s not working then it’s easy to complain about.”

Bob Sorensen, senior vice president of research at industry analyst firm Hyperion Research, shared the view that it’s easy – and often unfair – to criticize a project as technically complex as an exascale system. He drew the analogy to criticism of NASA when a space flight is delayed.

“Of course, they had a delay of flight, they’re, interested in safety,” Sorensen said. “They’re making some of the most complex machines in the world and putting them through some pretty rigorous environments. That is to be expected. The only people who are surprised are people who are outside the sector and don’t know. To me, the only people who should be a surprise are the outsiders. But this is all part of the continuum of deploying an HPC.”

Challenges will come up because the scale of building these systems is so challenging.

‘Not for the Faint of Heart’

“If every one of these (supercomputer) installations goes as smooth as glass and life is great then I have to question how aggressively the designers are pushing the state of the art,” Sorensen said. “If you want to build a COTS system, go build a COTS system. But if you want to build something that  advances the state of the art in a number of vectors – aggressive chips, aggressive network, aggressive programming – then you’ve got to expect some hiccups.”

Hyperion Research’s Bob Sorensen

Sorensen also said he’s impressed that the vendors building the first three U.S. systems didn’t use supply chain problems as an excuse for delivery delays.

“The idea that that system (Frontier) was assembled during a major pandemic with supply chain issues, where it would have been much easier just to say, ‘We can’t get the parts.’ And I’m sure it’s somewhere buried in some contract somewhere there was a ‘Get Out of Jail Free ‘clause for these kinds of things. The fact that they stuck to the schedule, and really worked to hit it, despite the fact that the supply chain issues were pretty onerous, to me is a testament to the the tenacity of getting this thing done.”

Whitt expressed the same sentiment.

“The excitement in leadership computing within HPC is that you’re serial port number one,” he said. “And for example, for the network insertion cards we were first off the manufacturing line, and that’s not an endeavor for the faint of heart, particularly at this scale. There’s going to be problems and you will have to work through those. So the hope is that we’ve benefited the nation in and the larger worldwide HPC community by working through a lot of these problems with HPE and AMD, helping them solve these problems for future deployments. These systems are popping up around the world…with these exact technologies. So it’s always satisfying to go through a lengthy co-design process and work with these companies with an eye on what’s best for the DOE user community and then for that to be broadly accepted and valued and picked up for other groups or users around the world. It’s a very satisfying experience.”