OLCF’s Doug Kothe on Pushing Frontier Across the Exascale Line and the Future of Leadership Supercomputers

Print Friendly, PDF & Email

Doug Kothe of Oak Ridge National Laboratory

Everyone involved in the Frontier supercomputer project got a taste of what a moonshot is like. Granted, lives were not on the line with Frontier as they were when Armstrong and Aldrin went to the moon in 1969. But in other ways there are parallels between the space mission and standing up Frontier, the world’s first exascale HPC system.

Both were decade-plus-long efforts involving thousands of people across the public and private sectors, requiring vision, coordination, determination and technical brilliance. Both involved large sums of money, both pushed technology barriers, both were highly public efforts with hard deadlines, and both put everyone involved under immense pressure.

And both succeeded – though both ran into last-minute panics. For the moon mission, the space capsule nearly ran out of fuel during the final lunar descent. For Frontier, there was a scramble to push system performance past the exascale milestone in time for the TOP500 supercomputer ranking announcement at last spring’s ISC conference.

In the middle of the Frontier storm was Doug Kothe, director of the Department of Energy’s Exascale Computing Project and associate laboratory director for the Computing and Computational Sciences Directorate (CCSD) at Oak Ridge National Laboratory, Frontier’s home. Two weeks before Frontier’s TOP500 triumph Kothe was named to the latter title, replacing Jeff Nichols, who has retired.

We sat down with Kothe on the day of the recent Frontier ribbon cutting ceremony at the lab, an event surely viewed by him and his colleagues with pride and relief.

“We don’t cookie cutter these (supercomputers),” Kothe said, “they’re all unique and, basically, serial number one. It’s a unique system with its own unique personality…

“I’m not sure people appreciate what it takes to deliver these systems,” he said. “They’re formal projects, and the corporate knowledge for executing a formal project in a way that maintains cost, schedule and performance resides here at Oak Ridge. It’s unique to see system after system come in and the deployments and procurements are ‘project-ized,’ and the staff here has been through many iterations. The workforce here is the best, in my view to be able to do this. To have the knowledge to anticipate what might go wrong, to risk manage and work closely with vendors – we have veterans.”

Kothe was named director of the Exascale Computing Project in 2017 and has more than 37 years of experience working in DOE national laboratories. He joined ORNL in 2006 as director of science for CCSD’s National Center for Computational Sciences, and from 2010 to 2015, he directed the Consortium for Advanced Simulation of Light Water Reactors, a DOE Energy Innovation Hub. The Computing and Computational Sciences Directorate that Kothe now oversees houses the Oak Ridge Leadership Computig Facility, which means that Frontier will never be far away.

Of course, for senior OLCF managers like Kothe, the pressure comes in waves but never lets up completely. Noting that OLCF continues to operate Summit, the former no. 1 supercomputer delivered in 2018, Kothe said, “It seems to be we have three systems in flight at one time – the one we’re operating, the one we’re deploying and the one we’re thinking about next. It’s this constant treadmill of these three phases.”

By deploying Kothe refers to the final testing and tuning of Frontier, getting it ready for, in the lab’s phrase, “full user operations” by early next year.

For Frontier, this process got a jumpstart from the May 30 TOP500 submission deadline: expectations were high that the system would ring up an exascale number on the HPL (High Performance LINPACK) benchmark. HPL is designed to engage and stress the entire supercomputer, a particular challenge for the HPE-built Frontier, comprised of 74 cabinets, each one comprised of 64 blades consisting of 2 nodes, the system utilizing a total of nearly 9,500 AMD EPYC CPUs and 37,888 AMD Instinct MI250X GPUs. All that technology would, in theory, deliver exascale performance, but getting it all to work in concert was an enormous challenge.

Networking among all those Frontier nodes is the HPE Cray Slingshot fabric, which took several months of tinkering to work correctly, right up until the TOP500 performance submission deadline. Stories abound of Frontier engineers in the final days pulling all-nighters to enable Frontier to run the HPL benchmark at peak performance.

“The HPL benchmark has withstood the test of time as an important science benchmark,” Kothe said. “It really shakes down a system, it’s very much legit in terms of stress and fabric… We got bandwidth measurements to indicate performance right at the spec. In terms of stressing the interconnect, it’s high bandwidth, or lots of lots of small messages that are latency-dependent… That’s why HPL is the beginning of acceptance.”

Now Oak Ridge, HPE and AMD are moving the system through the formal acceptance process, involving a broad workload of applications.

“We test three areas: functionality, everything has to work and get the right answer; stability, which is the toughest; and performance. So we go through these three phases and stability is generally the toughest, it’s where we basically use a mock workload of actual general availability production. And the mock workload is based on real apps that we anticipate are going to consume a decent amount of cycles on Frontier.”

A major aspect of the Exascale Computing Project’s work involved developing 24 scientific applications to run on Frontier. Ensuring the system runs those applications at target performance and efficiency levels is a key part of readying Frontier for general availability.

“The early science period for Frontier is going to be as intense as I can ever remember,” Kothe said. “Because of ECP, we’re going to load up with potentially a lot more applications and hit it pretty hard. A system of this size and complexity will always have challenges in production, but I’m confident it’s going to be rock solid and ready to go.”

With Frontier nearing full user operations, OLCF and DOE have embarked on the next system, issuing in late June a request for information (RFI) discussing its strategy for upcoming leadership-class supercomputers extending to 2030. The document calls for “the development of an approach that moves away from monolithic acquisitions toward a model for enabling more rapid upgrade cycles of deployed systems, to enable faster innovation on hardware and software.”

Which is to say: a move away from the traditional Summit/Frontier model.

“One possible strategy would include increased reuse of existing infrastructure so that the upgrades are modular,” DOE said. “A goal would be to reimagine systems architecture and an efficient acquisition process that allows continuous injection of technological advances to a facility (e.g., every 12–24 months rather than every 4–5 years).”

This is something new and centers on DOE’s notion of an Advanced Computing Ecosystem (ACE), which in turn points to the broadening HPC/AI workload scope for leadership supercomputers, their greater heterogeneity, and a premium placed on HPC resource sharing among the national labs. This includes “A capable software stack (that) will meet the requirements of a broad spectrum of applications and workloads, including large-scale computational science campaigns in modeling and simulation, machine intelligence, and integrated data analysis.”

As vendors prepare their responses to the RFI, Kothe and DOE colleagues will continue to refine their next-gen thinking. A point of emphasis in our conversation was workflows.

“One of the movements in the future is really to have what I would call a coupled facility workflow where, potentially, we’re autonomously driving experiments and operating facilities remotely,” Kothe said.

“So think about a light source where we’re gathering data real time and processing it in real time,” he said. “Some processing is happening back on the on the big iron, but a lot may be local, on the edge. So that whole workflow – call it a couple-of-facilities workflow, is one that I think is really going to inform what we need for the next system.”

Another hypothetical scenario:

“Let’s say from my office I want to turn on a scanning tunneling electron microscope remotely, and I want to drive it and I want to get data back in real time,” he said. “So there’s communication protocols at the software level implied here, it has to know that I’m Doug and Doug wants to turn the system on and it’s okay. So that’s of interest – self driving experiments, whether it be a chemistry lab or a STEM lab, it could be an additive manufacturing lab. We’re looking at how to make these workflows more efficient, more autonomous. And I think that is going to inform these next procurements as well. This sort of desktop analysis will inform not if but how we scale it up into literally coupled facility workflows.”

At this stage of development, questions of how to execute this strategy naturally remain.

“As to how that maps into software and hardware remains to be seen,” Kothe said. “But I do think you’re going to see more and more specialized accelerators, or hardware for particular aspects of the workflow, whether it’s data analytics, or training or inference, I think you’re probably going to see an eclectic combination.”