Developing a Software Stack for Exascale

In this special guest feature, Rajeev Thakur from Argonne describes why Exascale would be a daunting software challenge even if we had the hardware today.

Rajeev Thakur, Director of Software Technology for the Exascale Computing Project

Without the invisible infrastructure called the software stack, even the world’s fastest computer wouldn’t compute much of anything. Sitting between the hardware and the applications that users interact with, the software stack is the computer’s plumbing, power grid, and communications network combined. It makes the whole system usable. “Without it,” said Rajeev Thakur, “you literally have nothing.”

Thakur is the director of the Software Technology focus area for the Exascale Computing Project (ECP), an effort by two US Department of Energy organizations to develop computing systems that are at least 50 times faster than the most powerful supercomputers in use today. Thakur is overseeing the creation of the software stack that will undergird the wide range of applications that will run on the new systems. It is a monumental task. Just 5 months into the effort, it already encompasses several dozen projects involving hundreds of researchers at research and academic organizations throughout the country.

The brain of a desktop computer is a single CPU; the exascale system will have more than a million CPUs. To run on the exascale system, applications must be written in parallel; that is, problems must be divided into millions of pieces that are solved separately. Then the resulting data must be exchanged among the million processor cores to get the final result. The software stack enables developers to write or adapt their applications to run efficiently on those millions of cores and orchestrates communication among them, while minimizing energy consumption and ensuring that the system can recover from any failures that occur. And it provides the means for application developers to write and read their data, to analyze the data that is produced, and to visualize the results.

The ECP software technology effort is developing all of this, some of it from scratch and some by modifying and enhancing existing software, such as large libraries of mathematical computations. All of it must be usable by any application, from a program that helps design a nuclear reactor to one that models the life cycle of a star.

The software stack developers work closely with the applications developers, trying to accommodate their sometimes disparate requirements. Applications aimed at machine learning, for example, are extremely data intensive. Others might use an atypical programming language or a distinctive format for their data. “That’s one of the challenges,” Thakur said. “We have dozens of applications right now. They use different languages and have different ways of expressing their problem in a way that can run across millions of processors.” The underpinning software must be designed so that as many of the applications as possible work optimally on the new system.

Speed is a paramount concern. “The problems are not so simple that you can just divide them among the million processors and they will merrily do their own thing,” Thakur said. “And programs tend to run slower unless you really pay attention to what’s going on.” The processors need to ask one another for information. And as they do, other processors may be doing the supercomputing equivalent of “twiddling their thumbs.” One processor waiting 10 seconds for an answer being computed by another sets up a chain reaction that might slow the whole computation by an hour—not exactly high-performance computing. So devising a software version of an orchestra conductor to manage communication among the processors efficiently is a crucial part of the software stack project.

Managing the jobs of concurrent users is also a priority. As with any supercomputer, time on the exascale system will be shared by many users, some of them wanting to run their applications for days or weeks at a time. They will have to partition their problems into chunks of a few hours each, to run simultaneously with jobs of other users. Job scheduling software that comes built into the system allocates groups of processors and chunks of time. But it, too, will have to be adapted for the massively larger scale. So, too, will tools that diagnose and repair problems.

“All these things have to be sorted out,” Thakur said. “The scale makes it complicated. And we don’t have a system that large to test things on right now.” Indeed, no such system exists yet, the hardware is changing, and a final vendor or possibly multiple vendors to build the first exascale systems have not yet been selected.

The computer vendors share their roadmaps so we know their plans for the future,” Thakur said, “but we have to do things in anticipation of what may be coming. So right now we are writing software that can be used for any of these potential systems. It gets very challenging.”

But if all goes according to plan, by 2021 exascale computing should be a reality. The plan is for these superfast systems to run applications addressing critical research areas such as oil and gas exploration, aerospace engineering, pharmaceutical design, and basic science, among others. And underneath, making it all work, will be a vast software infrastructure: the software stack.

Sign up for our insideHPC Newsletter