I first met Rick Stevens at SC96, although I doubt he’d have cause to remember that meeting. He was a judge in the category of the HPC Challenge that I had entered (heterogeneous computing). We only spent a few minutes together as judge and contestant before he moved on to the other teams, but his comments and questions were perceptive; the meeting made an impression on me.
Today Rick Stevens is the Associate Director for Computing, Environment, and Life Sciences at the DOE’s Argonne National Laboratory. He is also a professor of computer science at the University of Chicago, a senior fellow of the Argonne/University of Chicago Computation Institute, and he heads the Argonne/Chicago Futures Lab (the group that developed the Access Grid collaboration system). And that just scratches the surface (click through for a more detailed look at his bio).
He is widely engaged in thinking about, developing, and teaching a host of technologies, from computer architecture and parallel computing to collaboration technology and virtual reality. All of which come together in his passion for reaching the next milestone in computing history: the exaflops computer.
Argonne is gearing up for their next generation system from IBM. They aren’t ready to talk too much about that system just yet (look for more news closer to SC09), but Stevens would say that they expect the system to be in the 10-20 PFLOPS range. ANL has about half a PFLOPS now, and Stevens says they are looking to this new system as a stepping stone into the exascale regime. “At that size we are within a factor of 50 or 100 of a PFLOPS, and the systems start to look like what an exascale system will look like.”
Mind the gap
Stevens is well aware of the challenges in software and architecture that stand between him and a useful PFLOPS, and he is involved in leading a variety of efforts to help bridge that gap. For example, ANL is building a consortium around their new machine to address the algorithmic and application challenges of scaling out applications into the petascale. This effort builds on the success ANL had with a similar consortium they built around their early Blue Gene in 2004. Stevens says the new consortium will address “how we get application developers and the broader academic community to have momentum addressing the software challenges in exascale,” using the new IBM system as a testbed.
Exascale systems, which Stevens is hoping we’ll attain by the end of the next decade, are likely to have on the order of billion-way concurrency, with a factor of 1,000 or so in the nodes themselves, and about 1,000,000 nodes per system. According to Stevens the recently announced Hybrid Multicore Consortium headed up by Oak Ridge will focus within a node, and the ANL consortium will be looking specifically at how to build applications that can use 1,000,000 nodes, attacking the problem from both the algorithmic and application technology perspective.
How do we get there? Clearly, a new way of thinking about algorithms is needed in which there aren’t any inherent bottlenecks to scaling arbitrarily large (or small, for that matter) — Stevens likes the term “scale invariant algorithms” for this approach. “The revolution that I’d like to see is that thinking about problems at scale is easier than thinking about them sequentially,” he says.
One approach that may hold promise is to reformulate problems from the perspective of a single entity — a grid point, molecule, or whatever. “We see this kind of design with the 1014 entities in the human body, and that works just fine.” It occurs to me while he’s making this point that programming this way actually has a deep physical analog. One of the side effects of Einstein’s general theory of relativity was that people were able to demonstrate that physics (like politics) is local. Bodies don’t need universal knowledge of all the masses in their vicinity in order to know what kind of path to travel: they simply follow the straightest possible path in their curved space, like the ant walking on an apple who traces out a geodesic without doing anything other than putting one foot straight in front of the other. It took physicists a long time to get to this kind of elegance, and it is both satisfying and humbling to see its potential in computing as well and to wonder where else we’ve missed the boat in our modeling of the real world.
Facing a constrained reality
Developing programs at the entity level may also help address the expected challenges of hardware and software failure in exascale systems. Stevens draws a parallel with current systems that work this way today, like Google and the Internet. “The Internet is always up as a whole,” he says, “but parts of it are always down. Our current hardware model is crystalline, everything has to work all the time in order for anything to work. We need to move from this way of thinking to a more biological approach.”
Part of the design for this way of building at scale may include over-provisioning at many different levels, and with many different kinds of resources. “Systems may have all kinds of resources that programmers can choose to use in their applications,” he notes, “but they can’t use them all at once. Our current notions of efficiency may not make sense when we have to optimize our resource use to stay within a fixed power budget.” This is a different way of thinking than most of us have about our computers today, but it is the mode of thinking that dominates the natural world. “For example, we see this in the plant world,” Stevens explains. “Plants have much more potential for growth than they ever achieve because they have a fixed energy budget available to them that they have to allocate among all the processes of life.”
“We’ll definitely have MPI, at least in the early exascale systems,” Stevens answers in response to my question about how we’ll program the very large-scale systems he spends so much of his time thinking about. “A new programming model takes about a decade to take hold, so whatever we have today is what we are going to have for these first exascale systems.” Is this a problem? “The jury is still somewhat out, but I don’t believe that the barrier to achieving exascale computing is the programming model.”
The rest of the picture
Stevens is part of several other efforts that, when taken together, form a holistic approach to reasoning about the design of exascale systems. ANL and Stevens are part of the broader community effort to look at the system software needed for exascale computers, the International Exascale Software Project (the website for this group is quite nice, by the way). The IESP has held two international meetings already, with a third planned for Japan on Oct 19-21. Where the ANL-based consortium is focused on applications, the IESP is more focused on system tools, operating systems, compilers, and the like, Stevens explains. The IESP is interesting because it is a grassroots attempt to coordinate the research efforts of organizations around the globe so that we can all get to effective exascale computing faster than if any one nation tried to do all of the research on its own.
Stevens is also part of a cross-lab effort in the DOE to build the science case for exascale — what good will all of this computing power do once we finally have it in place? Stevens co-chairs this effort with Andy White, which is aimed at building the science case and producing a technical roadmap for the DOE. Over the last year the group has held a series of workshops, called the Scientific Grand Challenges Workshop Series, with practitioner communities in national security, biology, basic energy science, climate science, high energy physics, and other areas central to the DOE mission (for a full list see the web site). Each of the meetings will ultimately result in a report that documents the impact of large scale computational modeling and simulation in the domain area. The presentations and materials from the workshops are already available online, and the reports will be available to the public once they are complete.
Getting there sooner rather than later
When you look at all of these efforts, it is clear that there is a lot of federal government leadership going in to meeting the exascale goal by 2020. Is it necessary? “The vendor community won’t get to exascale without significant federal investment by 2020,” says Stevens. “It will take at least 20 years for industry to get to the exascale on its own.” Why does that matter? Not because we just want to cross an arbitrary computational boundary by some arbitrary date. “Computing at that scale will give us a new way of thinking about doing science,” he says, “and that’s what I find so exciting about this work.”