An Interview with Jack Wells, Oak Ridge National Laboratory

Click Here for the Audio Interview with Jack Wells

Oak Ridge National Laboratory (ORNL) has been to the winner’s circle on numerous occasions. But the Jaguar supercomputer, to date, has been one of their best victories, not only for ORNL, but for Cray, AMD and at a much larger level, U.S. technology competitiveness.

Less than a month ago, ORNL was finally able to announce a significant award to Cray for the next stage of Jaguar, growing the system into the ten to twenty petaFLOPS range with the addition of GPU accelerators from NVIDIA.

The new system will be called Titan.

We caught up with Jack Wells, Director of Science for the National Center for Computational Sciences, a U.S. Department of Energy (DOE) Office of Science user facility at ORNL, to talk about Titan.

The Exascale Report: How many individual users or organizations do you see accessing Titan in 2012 – 2013?

WELLS: I think that we would be able to maintain a population of projects similar to what we have now. Through the INCITE [Innovative and Novel Computational Impact on Theory and Experiment] program at our leadership computing facility at Oak Ridge, we support and collaborate with 32 projects in 2011, and the available resources are oversubscribed. We also support nine projects through the ASCR Leadership Computing Challenge (ALCC) program and a larger number of smaller projects through our Director’s Discretionary Access Program. I expect the demand and our ability to support that number of projects will be sustained.

TER: Do you have any sense of the number of individual users?

WELLS: We do track the number of individual users. Our current number on the Jaguar supercomputer is approximately 800, and we expect this to remain fairly constant. Our users work on grand challenges in science and engineering. They engage in simulations to understand the molecular basis of disease, the intricacies of climate change, and the chemistry of a car battery that can last for 500 miles. They may be working on a biofuel that is economically viable or a fusion reactor that may someday provide clean, abundant energy. They have one thing in common—they use computing to solve some of the planet’s biggest problems. These individual users are participants in projects that can have, say, a dozen users, so the number of awards through INCITE, ALCC, and Director’s Discretion is small enough to keep the time allocations large. In the 2011 calendar year, for example, the average INCITE allocation was 27 million processor hours. One project received 110 million hours.

TER: How will the user access to the systems and their ability to develop and test applications change from Jaguar to Titan?

WELLS: We are working with Cray, PGI, CAPS, Allinea, Vampir, and the scientific library teams to have optimized tools for the users. We are also working now to get a core set of projects ready for Titan. One of the main questions that arose two years ago when we proposed the hybrid architecture focused on the usability of the future computer: Would researchers be able to squeeze results out of a GPU-accelerated machine without too much programming pain? So we have focused on a set of six applications that are representative of our workload, and are working with the architects of the codes and some of the main users to get these codes ready—and it’s actually become part of the project. We have also developed a training curriculum, consisting of conferences, workshops, tutorials, case studies, and lessons learned, that covers tools and techniques for realizing the benefits of hybrid architecture.

TER: And could you clarify for us what those six applications are?

WELLS: One is S3D (Direct Numerical Simulation of Turbulent Combustion), a chemical combustion code out of Sandia used by Principal Investigator Jackie Chen to simulate burning fuels. Another is the Wang-Landau LSMS code (Wang-Landau Linear Scaling Multiple Scattering), a materials first-principles code that’s been developed here at Oak Ridge National Laboratory. Another code is LAAMPS (Los Alamos Molecular Dynamics code) authored by colleagues at Los Alamos and used here to simulate lignocellulose. Another is PFLOTRAN (Modeling Multiscale-Multiphase-Multicomponent Subsurface flows), a subsurface transport code that Peter Lichtner from Los Alamos uses to study carbon sequestration and underground transport of contaminants. Another is CAM-SE (a scalable, spectral element dynamical core for the Community Atmosphere Model), a community atmospheric code from the climate community. And the sixth is DENOVO, a neutron transport code. That’s a part of the workload within the Consortium for the Advanced Simulation of Light-Water Reactors—the DOE [Department of Energy] nuclear energy modeling and simulation hub led here at Oak Ridge.

TER: So Jack, when you say there is work being done now, what do the users – the application developers –the programmers – the scientists – what do they actually do now in order to get ready for Titan?

WELLS: When we started the work on these codes, there were few tools to help with the work. The teams have spent most of the time restructuring codes to expose more levels of parallelism and promote data locality. These are exactly the type of transformations that make codes run better on both accelerators and multicore CPUs. Some of the codes have kernels in CUDA, others are using directives with the compilers. Still others are using optimized libraries.

TER: Would you say there is a lot of manual work at this point?

WELLS: Things are improving in that regard as compiler directives are advanced. So many of the compiler companies are coming out with common compiler directives that help a great deal.

TER: So you have a very unique perspective on this – you have the experience on the user side and you have the experience on the leadership side – would you say that from the early days of being a user on very large, advanced computing systems – to today, are things easier for the user or are they more difficult

WELLS: If you take the 15- or 20-year view, I’m not sure. There are many new challenges, for example, the increasingly massive parallelism. But, I think the challenge of utilizing parallel computing in the early days was similar in character. If you think about message passing hardware before MPI, there was a lot of diversity. But then standards developed and our investments in software had a longer life. I think that’s what we’ll see here. Some people may tell you it’s a really big change—and it certainly is—but from my point of view, the fundamental issue of revealing new levels of parallelism reminds me of attempts to use the hierarchical memory and processor structure within the Intel Paragon supercomputers that we used in our laboratory during the 1990s.

Now the dramatic difference is the number of processing threads, and this has increased rather steadily. There was a time when 100-way parallelism was a big deal, and then 1,000 and 10,000, and now we have on the order of 100,000 processing cores, and they keep increasing. So, this is a big challenge, and what we see is that many of the applications that are doing well dealing with very large core counts—making progress using a large fraction of the full capability—many of them have taken on bigger, more realistic problems—problems that implement high degrees of fidelity and embrace more physical phenomena, stochastic behavior, ensemble behavior. These kinds of applications are the ones that are advancing science and engineering using the full capability of the machine.

TER: A number of people still question why we need to continue to push the technology envelope. They don’t get it – in terms of the importance. So Jack, what gets you the most excited when you think about how this processing capability may impact science and discovery? How do you see life changing when 20 petaFLOPS becomes the bottom of the production and performance scale.

WELLS: This is a very exciting question. Having such broad access to predictive simulation capability would dramatically advance many areas of science and engineering. I believe that we would dramatically accelerate the invention of new energy technologies, such as a 500-mile battery for transportation or very efficient solar cells, or the safe life extension of nuclear reactors. For scientists and engineers, this would revolutionize the practice of research. Simulation would not replace experiment’s role in the scientific method or in engineering design principles. But it would allow us to be very focused and efficient with the most interesting and important experiments to perform conceptual designs to construct prototypes.

TER: And in terms of an access convention, do you see cloud playing a role at all?

WELLS: Oh yes, I do. I think that is part of the story of computational science and engineering. There’s no doubt that this will be a cost-effective solution for some, maybe many, users. Scientists will take advantage of these resources—where it makes sense for them—and that is the way it should be.

TER: And on an international scale, do you see the U.S., because of efforts like Titan and what will be coming after that, maintaining some level of technology leadership?

WELLS: My understanding of U.S. science and engineering policy is that there’s a broad consensus around the role of the federal government in advancing supercomputing and then using supercomputing as a tool for advancing science, technology, U.S. competitiveness in general. I think there’s a strong bipartisan consensus there.

Of course, other countries also perceive the value of supercomputing, and they are investing as well. I think leadership should be defined in terms of impact—and that’s measured over an extended period of time. We’re committed to leadership in impact. Now certainly Europe and Asia are making significant investments in supercomputing. Some of the investments in Europe that stand out to me are their investments in application software. There are strong European research teams and computational science programs in Europe that are funding these integrated code teams to advance particular areas of engineering and science in which they feel they have leadership. And I think that’s significant.

TER: What steps are you taking at ORNL and the Leadership Computing Facility to ensure long term program success and to keep Oak Ridge and all your suppliers and partners working in harmony moving forward?

WELLS: We’ve touched on several points that are relevant to this question, but for long-term program success I’d emphasize the importance of a sustained policy on the role of supercomputing in science and engineering. And I think we have that. If that continues, it’s a strong foundation on which to build the future.

Relative to the current inflection point we have in technology, the energy consumption of these large machines necessitates energy-aware computing as we move forward. The broader industry is already doing this with investments in mobile computing and game hardware. The broader IT industry is making significant investments in these accelerator processors. So the technology risks of our platforms, going forward, are really quite modest. There’s always a risk of putting these machines together at scale, but that’s something we have a track record of managing through years of deploying leadership-class computing systems. And we have tremendous confidence in our vendor. Certainly the biggest perceived risk over the last two and one-half years as we’ve planned for Titan has been its usability by science and engineering teams. As I tried to emphasize previously, we are playing a role, in concert with our vendor partners, in making sure the machine is demonstrably useable. Computer science researchers will also have a role in generalizing or extracting the lessons that are learned throughout the Titan project and making them more readily available to a broader community.

For related stories, visit The Exascale Report Archives.