Frontier Testing and Tuning Problems Downplayed by Oak Ridge

Print Friendly, PDF & Email

OLCF’s Justin Whitt

Everything about the Frontier supercomputer, the world’s first exascale system residing at Oak Ridge National Laboratory, is outsized: its power, its scale and the attention it draws. In HPC circles, the attention and talk have increasingly focused on performance problems as the lab gets the system ready for “full user operations” by this coming January.

We’ve reported on problems with Frontier’s HPE Cray Slingshot fabric late last year and into the spring of this year, problems that the lab and HPE worked to overcome before Frontier was certified to have crossed the exaFLOP milestone on the HPL (High-Performance LINPACK) benchmark last May by the TOP500 organization. The current problems appear to center on Frontier’s stability when executing highly demanding workloads, with some of the problems focused on AMD’s Instinct GPU accelerators, which carry most of the system’s processing workload and are paired with AMD EPYC CPUs within the system’s blades.

In interviews with us this week, Justin Whitt, program director for the Oak Ridge Leadership Computing Facility (OLCF), confirmed that he and his staff have run into issues, but he emphasized that they are typical of those he has dealt with in his decade-plus of testing and tuning leadership-class supercomputers at the lab.

“It’s mostly issues of scale coupled with the breadth of applications, so the issues we’re encountering mostly relate to running very, very large jobs using the entire system … and getting all the hardware to work in concert to do that,” Whitt said. “That’s kind of the final exam for supercomputers. It’s the hardest part to reach. And those are the kinds of issues we’re seeing at this point, having the tuning be general enough that it benefits a wide breadth of applications.”

He said that running the HPL benchmark is different from engaging the system while running scientific applications “without a hardware failure, without a hiccup in the network, and getting everything tuned.”

Whitt declined to go into details on Frontier’s “hiccups,” but said he and his team are working on improving Frontier’s current mean-time-to-failure rate.

“We are working through issues in hardware and making sure that we understand (what they are),” he said, “because you’re going to have failures at this scale. Mean time between failure on a system this size is hours, it’s not days. So you need to make sure you understand what those failures are and that there’s no pattern to those failures that you need to be concerned with. And then it’s about tuning the programming environment so you’re … getting maximum performance on the applications.”

The goal, Whitt said, is to enable users to be productive in their scientific research, which varies by application. A day-long run without a system failure “would be outstanding,” Whitt said. “Our goal is still hours” but longer than Frontier’s current failure rate, adding that “we’re not super far off our goal.”

Whitt declined to blame most of Frontier’s current challenges on the functioning of the Instinct GPUs. “The issues span lots of different categories, the GPUs are just one.”

“A lot of challenges are focused around those, but that’s not the majority of the challenges that we’re seeing,” he said. “It’s a pretty good spread among common culprits of parts failures that have been a big part of it. I don’t think that at this point that we have a lot of concern over the AMD products. We’re dealing with a lot of the early-life kind of things we’ve seen with other machines that we’ve deployed, so it’s nothing too out of the ordinary.”

That said, Whitt also said the problems presented by Frontier have “been a little bit harder” because of the scale of Frontier, which is comprised of 685 different parts, 60 million parts in total.

Added to the pressure of finalizing the system by January are pandemic-related supply chain problems that delayed delivery of Frontier by about three months. This, in turn, delayed the start of testing and tuning.

“We’re nearing the end of the process and we’re largely on track,” Whitt said. “When we put together the plan for Frontier back in 2019, even late 2018, we said we’ll be ready for user programs on January 1 of 2023. And that’s where we’re still at, we’re on track to be ready for user programs then. We did take the quarter hit with the supply chain issues… We’re getting very close to the end of that part of the schedule.”