Getting to Exascale: Nothing Is Easy

In the weeks leading to today’s Exascale Day (1018) observance, we set ourselves the task of asking supercomputing experts about the unique challenges, the particularly vexing problems, of building a computer capable of 10,000,000,000,000,000,000 calculations per second. Readers of this publication might guess, given Intel’s trouble producing the 7nm “Ponte Vecchio” GPU for its delayed Aurora system for Argonne National Laboratory, that compute is the toughest exascale nut to crack.

But according to the people we interviewed, the difficulties of engineering exascale-class supercomputing run the systems gamut. As we listened to exascale’s daunting litany of technology difficulties, it occurred to us: Could we instead focus on what’s easy about exascale?

But the answer, of course, is nothing. Nothing is easy about exascale.

Let’s begin with a man in the eye of the exascale storm, HPE-Cray’s Fellow and Chief Technologist for HPC Nic Dube, who is leading the company’s strategy and execution of all three exascale systems scheduled for delivery starting next year: Frontier, at Oak Ridge National Laboratory, El Capitan at Lawrence Livermore National Laboratory and Aurora, for which Intel is the prime contractor and to which HPE will contribute the Cray Shasta HPC software platform.

Adding to the exascale challenge, Dube said, is that DOE and the Exascale Computing Project (ECP) have included as part of the core mission of the U.S. exascale effort that these not be stunt machines tricked up to hit the exascale benchmark.

HPE’s Nic Dube

“It’s power, it’s software, interconnect, I/O infrastructure, it’s on everything, because what we’re building are really very capable exascale systems. And we’re building multiple systems. It’s not just about exascale being measured as a Linpack run. Ultimately, it’s really all about delivering a 50X application performance improvement over the 20 petaFLOP-era systems. So it’s really that acceleration of real workloads that matters.”

Citing the critical role supercomputers have played this year in developing treatments for COVID-19, he said the goal is development of functional systems that run advanced applications within a larger exascale ecosystem.

“Building an exascale supercomputer that’s going to do an exascale Linpack is one thing, but building an exascale supercomputer that’s also very capable on a breadth of workloads means that you need to have a very capable interconnect, a very capable software stack and a very capable IO subsystem. I think those are the big challenges of exascale, as I see it, so that those systems have a big impact to the world we live in.”

Then there are the power requirements.

“You’re looking at 30 to 40 megawatt systems, I mean that’s a lot of power to deliver,” Dube said. “We need to build a high voltage AC infrastructure, 480 VAC and big pipes to deliver all of that water to the rack. We needed to design 400 kilowatt racks. Most of colos are still in the 12 to 15 kilowatt range. So in something the size of a large fridge, we’re going to pipe 480 VAC into it, and it’s going to drive 400 kilowatts when it’s running flat out. And there’s going to be a 2.5-inch pipe of water coming in and going out. And oh, by the way, we’re going to use like 100 of these to build a system that’s all going to compute together. This by itself is already in terms of power delivery, and the cooling you need to have to extract, a massive engineering challenge and, I believe, an engineering accomplishment because we’re tracking pretty well on that.”

He compared exascale’s complexity to the relative simplicity of the problems dealt with by hyperscale companies.

“Here we’re talking about nodes that have multiple GPUs, and you’re looking at the scale of 10,000 of those nodes. So you already have hundreds of thousands of threads and massive concurrency per node, and all of that needs to clock in and synchronize. And so building up from the infrastructure, you need the infrastructure to keep all of that stuff cool and operating at the right power level because if they don’t, if you have outliers, right, and then I say a processor or an accelerator starts getting hotter, it’s going to clock down…, the whole system will slow down while waiting…. And then you need to have an interconnect that’s able to fly packets between those units without congestion. That’s what the Slingshot interconnect brings, it’s quite innovative in many ways for congestion control so you’re not waiting for the straggler packet. That’s key to get the code to the next barrier where you need everybody to synchronize.

“Then you need a software stack on top of that, that’s also not going to jitter any of the nodes and create unbalance as the code is executing,” he said, “so that everything is kind of optimal, the computer is  free to flow…, that’s key. There’s not just one but all of those things need to come in together as a system to deliver exascale. For me, as a system architect, this is the ultimate challenge, it all needs to come together and all the pieces are super important.”

Robert Wisniewski of Intel

Robert Wisniewski, Intel‘s chief architect HPC, Aurora technical lead and PI, uses “heterogeneity” to describe the problem discussed by Dube.

“I think what’s coming up today is heterogeneity on a couple different fronts,” he said. “When we talk about heterogeneity, oftentimes the first thing that people go to is GPUs, and that’s absolutely part of it. But it’s broader than GPUs. If you take a look at Intel’s portfolio, we’ve made investments in the FPGA space, we’ve made investments in the matrix, i.e. the AI space, through some of our acquisitions that will be coming into play. But it’s even broader than that. In terms of heterogeneity, we are now taking a look at mixed precision to get a lot of value out of algorithms. And we’re also taking a look at a software stack where we can start looking at variable precision calculating and lower precision, and then getting to the accuracy we need at the end of the algorithm.

“So heterogeneity as a broad topic is the word that I would key on. It induces challenges in both the hardware as well as in the software space. In the hardware space, the question is how are we going to pull all of these different pieces together? I mean, if you just kind of glue them together in a Frankenstein-like model, while the raw piece might indeed be there you’re not going to be able to utilize it. So Intel has come out with technologies like CXL (interconnect). And then in the software space… it’s also about how are we going to program these right. So Intel is driving OneAPI, which is a common paradigm from a programmer point of developer, to be able to leverage this layer of heterogeneity without undue burden.”

Brad McCredie from AMD

We also spoke with Brad McCredie, corporate VP, GPU Platforms, at AMD, whose EPYC CPUs and Radeon Instinct GPUs will work as an integrated tandem in Frontier and El Capitan. Until early this year, McCredie had been a corporate VP at IBM, which built the world’s second and third most powerful supercomputers, Summit (at Oak Ridge National Lab) and Sierra (at Lawrence Livermore National Lab) that utilizes IBM POWER processors comprised of IBM CPUs and Nvidia GPUs.

McCredie said AMD offers the advantage of tighter CPU-GPU integration because the processors are built “in the same design shop,” which addresses what he said is the biggest exascale challenge: performance density and, closely related, memory management between the two processors.

“Managing the memory of an accelerated computer,” he said, “you have to copy the data from the CPU to the GPU and back again, that’s a lot of lines of code, it’s a lot of time spent moving data back and forth. Now one of the things that we can do in designing both the CPU and GPU is we can enable the hardware to manage that memory movement versus the programmer having to do it explicitly. It’s called coherency, and we are putting that into the hardware and that enables the CPU and the GPU to work better together, it gives you more performance density and makes the computer easier to use for a broad range of users…. A lot of time is spent in some of these accelerated computing programs moving the data back and forth and launching accelerators. Now you can do that faster with higher speed buses between the elements, and that provides a lot of value.”

The objective, McCredie said, is achieving a balance between processing and memory.

“One comment you often hear in the industry is, ‘It’s about the memory, stupid.’” McCredie said. “You need to bring focus on the memory because often we get unbalanced systems where you get so much processing power but not enough memory bandwidth, memory capacity, memory capability, then it turns out the memory throttles your performance… It’s any element where you can start moving the memory around (to achieve) the performance objectives you have versus being compute-gated. A really well designed system is largely compute-gated, so there’s a lot of value in getting the memory out of the way, from a performance perspective. That’s something that we are investing in with our partner (HPE-Cray).”

Al Geist, ORNL

For Al Geist and Justin Whitt at Oak Ridge National Laboratory, next year’s shceduled arrival of the HPE-Cray-AMD Frontier system could be acutely rewarding, considering the lab has been wrestling with exascale challenges for nearly 10 years.

“When we first started having these workshops on exascale back in 2009,” said Geist, corporate fellow at ORNL’s Computer Science and Mathematics Division, “the reason we felt exascale was this insurmountable mountain … really came down to three big challenges.”

They were: power requirements, system resilience and programmability.

“The estimates were … that (exascale) would take at least 100 megawatts to run a one exaFLOPS computer,” Geist said. “And some of the studies said it was as much as a gigawatt , and that was just impossible, that’ll never work unless we can somehow get the vendors to figure out a way to produce that amount of computing with a lot less electricity, because just the electric costs for 100 megawatt computer would be $100 million dollars each year. So that that would just be untenable.”

The second major issue: system resilience.

“With a system that was so big with so many parts, the chances of something being broken at any particular time was very high,” said Geist. “So the machine would have to, in some sense, adapt on the fly …, and before you could fix it, something else would break. And so there was this need for a much more flexible and resilient system than what we were used to back in 2009.”

The third: programming.

“The next question was: could you actually program a computer like this?,” he said, “because it was pretty easy to do a back-of-the-envelope calculation that the speed of the processors, if they are run in an energy efficient way, was in the one gigahertz range. Well, to get to an exaflop, that meant that you needed to have a billion of those processors running. And so we were doing big scale calculations with 100,000 processors being used to solve a problem, we just shook our heads when we thought how can we possibly within just a decade suddenly be able to do billion-way parallelism. Can the codes to the applications actually have that much parallelism in it?”

“We looked at all this and said, ‘Wow, that’s going to be really, really hard,’” said Whitt, project director for DOE’s Oak Ridge Leadership Computing Facility (OLCF). “And for the next 10 years, experts and scientists from across the DOE complex engaged with these public-private partnerships with technology vendors and people in academia to work on these problems and to get us to where we’re at today, where exascale systems are possible to deploy.”

The three solutions? To summarize:

On electrical power, Frontier will have more power-efficient processors and other gear (more on this below). But having said that, Frontier will still require between 30 and 40 megawatts, “an incredible amount of power to supply to a single data center,” Whitt said.

ORNL’s Justin Whitt

This has meant “a tremendous amount of site prep that has to go on behind the scenes ahead of these computers…,” Whitt said.  “That means new power lines, that means large voltage and medium voltage transformers, new switches. And if we have 40 megawatts of power going into the system, we’ve got to get 40 megawatts of heat out of that system. These systems tend to be water cooled because of their density. So that means new mechanical plants, new cooling towers, that type infrastructure to get that heat back out of the system.”

For better power efficiency, the system also will build on the heterogenous accelerated node strategy, combining CPUs and GPUs, first used in the Titan supercomputer installed in 2014 and then ORNL’s IBM-Nvidia Summit supercomputer.

On the problem of system resiliency, Geist said the lab worked with “commercial entities to try to develop the system software that would itself be very adaptable. We started to see that in Titan. And then, by the time we got to Summit we got to a point where the software really expects the system to be failing.

“Resilience hasn’t been solved,” he explained, “the appearance of the resilience has been solved. So the user doesn’t see the machine failing all the time, but underneath the covers, it actually is, the system is adapting to those changes on the fly, thanks to a lot of investments that the Department of Energy and other organizations have put into trying to build more reliable, resilient system software.”

For the third problem, programmability:

“When we had Titan, we had about 100,000 nodes,” said Geist. “But when we went to Summit, it only has like 4000 nodes, it’s these very big nodes with lots of GPUs and processors on them. The users, instead of seeing it as … several million-way computing, they see it as, ‘Well, here’s a GPU, and I will program that GPU to do this math job.’ They don’t think about it as ‘Oh, that GPU has hundreds of computing units in it that are all working together on this.’ We’ve abstracted away the concept of this very high-scaled programming to a smaller number. It’s still a huge number, tens of thousands and sometimes hundreds of thousands of units. But .. each individual unit is 100 or 1000 compute elements inside of it, (but) they don’t think about the individual parts.”

One long-time HPC strategist, who requested anonymity, told us he questions the value delivered by high-end supercomputers because, he said, other than for the occasional “hero run” their extreme power is usually divided up by multiple users running multiple applications simultaneously.

“People overlook how these machines are used in practice at these different institutions. Going back to the 1990s, it’s always been to break up the machine into virtual pieces and make allocations of nodes and other resources. And when that person is done, that stuff goes back to a pool. And it gets reallocated by a scheduling system right away. As you get to systems of this size, the likelihood of using the machine at scale for any prolonged period of time is very low.”

Hyperion’s Bob Sorensen

Bob Sorensen, senior vice president of research at industry analyst firm Hyperion Research, touched on this theme when he told us in an email that cost and using the system to derive maximum value are major exascale challenges.

“Because building an exascale system currently involves considerable budgets that can exceed $500 million,” he said, “the process of determining access to such systems, especially for jobs that require a significant portion of the overall system resources, considerable wall clock time, or some combination of both, is an increasingly important, and perhaps undervalued, consideration in the overall utility of an exascale machine over the course of its operational lifetime. To that end, the procedure for determining which jobs run, for how long, and with what percentage of the overall system resources is one that needs careful deliberation to ensure that critical applications receive the attention that they deserve. The goal of any such effective operational plan is not simply to keep an exascale system busy all the time, but rather to ensure that it is available to the jobs that can benefit most, not just from a computational perspective, but from a larger end use mission-oriented perspective.”

Echoing ECP’s capable exascale mission, he also warned of problems that can come from attaining exascale for its own sake.

“Care must be taken to ensure that (design) decisions – driven to a great extent by the goal of achieving exascale performance – does not inadvertently leave important algorithms or applications by the wayside simply because they may not be best suited to a particular exascale architecture. For example, the current emphasis on GPUs to provide computational heft could in fact have a detrimental effect on key areas of research that rely primarily on CPU-only implementations; likewise with applications that may have more stringent byte/flop requirements than is generally supported today. Ultimately, there needs to be careful consideration that any new architecture that favors a specific suite of hardware/software elements may indeed open up new vistas of performance in some applications, but that they may also be closing the door on others.”

Comments

  1. Can we have an exacscale mobile phone in at least 40 years ?

  2. Regarding the anonymous comment about multiple jobs carving up big machines, DOE Leadership Computing at both Oak Ridge Leadership Facility (OLCF) and Argonne Leadership Facility (ALCF) target large, difficult problems that require a large amount of memory in addition to the compute resources. You cannot look at the number of jobs by node counts, because we allow larger jobs to run longer than smaller jobs. You have to look at total node-hours by job size. At OLCF, “capability” (i.e., large-scale) jobs account for over half of our node-hours each year.