The Argonne Training Program on Extreme-Scale Computing starts next month. Its organizer, Paul Messina, calls on more institutions to offer such training but also to influence university curricula.
Carrying out computational science and engineering (CS&E) research on very powerful computers – such as the top 50 in the semi-annual TOP500 rankings – requires knowledge of a broad spectrum of topics and many skills. Today’s high-end computers are not easy to use: they have tens of thousands to millions of cores, and the architectures of both of the individual processors and the entire system are complex. Achieving top performance on these systems can be quite difficult. Furthermore, every three to five years, HPC architectures change enough that different optimizations or approaches might be needed in order to use the new systems efficiently.
But the characteristics of high-end computer systems are by no means the only source of difficulty. The scientific problems that are tackled on such systems are typically quite complex, involve more than one phenomenon and different spatial and time scales, and are often at the leading edge of the scientific or engineering domain. Consequently, CS&E projects are usually carried out by teams with several to as many as dozens of researchers who have expertise in different aspects of the science, mathematical models, numerical algorithms, performance optimization, visualization, and programming models and languages. Developing software with a large team instead of one or a few collaborators brings its own challenges and need for expertise in software engineering, which also needs to be included in the team.
Reflecting on the CS&E landscape described above, I was motivated to organize the Argonne Training Program on Extreme-Scale Computing (ATPESC) – an intense, two-week program that covers most of the topics and skills that are needed to conduct computational science and engineering research on today’s and tomorrow’s high-end computers. The program has three goals. First, to provide the participants with in-depth knowledge on several of the topics, especially programming techniques and numerical algorithms that are effective in leading-edge HPC systems. Second, to make them aware of available software and techniques for all the topics, so that when their research requires a certain skill or software tools, they know where to look to find it instead of reinventing the tools or methodologies. And third, through exposure to the trends in HPC architectures and software, to indicate approaches that are likely to provide performance portability over the next decade or more.
Performance portability is important for applications that often have lifetimes that span several generations of computer architectures. It is tempting to write software tailored to the platform one is using and exploiting its special features at all levels of the code. Researchers in the early stages of their careers may not be aware that by taking that approach, they are exposing themselves to having to rethink their algorithms and rewrite their software repeatedly. Some rewriting is inevitable but by designing the software architecture such that the use of features specific to a given platform are at a low level and easily identifiable, one greatly reduces the effort to transition to future architectures. In addition, using software components written by world experts who track the evolution of supercomputers and optimize their packages to the new systems also reduces the effort to achieve performance portability. However, one first has to know that such software components exist.
At Argonne National Laboratory, we have Mira, an IBM Blue Gene/Q system with nearly a million cores; therefore we have first-hand experience with the challenges described above. Mira is the current flagship system in the Argonne Leadership Computing Facility (ALCF) that is funded by the Office of Science of the US Department of Energy. The ALCF and the companion Oak Ridge Leadership Computing Facility were established by the DOE to support computationally intensive, large-scale research projects with the potential to significantly advance key areas in science and engineering. The use of systems like Mira can enable breakthroughs in science, but to use them productively requires significant expertise in computer architectures, parallel programming, algorithms and mathematical software, data management and analysis, debugging and performance analysis tools and techniques, software engineering, approaches for working in teams on large multi-purpose codes, and so on. Our training program exposes the participants to all those topics and provides hands-on exercises for experimenting with most of them.
The ATPESC was offered in the summer of 2013 for the first time. It will be offered again this year from 3-15 August, taking place in suburban Chicago. The 64 participants were selected from 150 applicants and are doctoral students, postdocs, and computational scientists who have used at least one HPC system for a reasonably complex application and are engaged in or planning to conduct computational science and engineering research on large-scale computers. Their research interests span the disciplines that benefit from HPC, such as physics, chemistry, materials science, computational fluid dynamics, climate modeling, and biology.
In other words, this is not a program for beginners. Many institutions world-wide offer introductory courses. Some institutions offer advanced training programs in scientific computing but they cover fewer topics and usually in less depth. The strong interest in our program indicates that we are filling a gap. For example, not many university graduate programs in the sciences cover software engineering or community codes. PhD students in CS&E are understandably instructed to work mostly on their own in implementing the codes for their dissertation research. Yet when they begin working in research laboratories or industry, they almost always will work as part of a team and on enhancing existing software.
Some research laboratories are planning to offer similar training programs. We welcome the proliferation, since there is a growing need for knowledgeable computational scientists and engineers as the value of HPC is recognized in many fields.
The ATPESC curriculum is organized around seven major areas:
- Hardware architectures
- Programming models and languages
- Numerical algorithms and software
- Toolkits and frameworks
- Visualization and data analysis
- Data intensive computing and I/O
- Community codes and software engineering.
For most of these topics multiple options are covered. For example, programming models presented include advanced MPI and OpenMP, OpenACC, UPC, Chapel, Charm++, hybrid programming, and accelerator programming. The agenda and detailed information about the program can be found here.
Although in the future we are likely to continue organizing the material around those seven areas, we will review the contents and introduce new material as appropriate. For example, this year the hardware architectures track will focus much more on explaining how architectural features such as SIMD units or memory architectures affect performance of scientific codes.
Each day, there are seven hours of lectures and at least three hours of hands-on exercises on high-end computers. In most cases, the lecturers are global experts on the topic they cover. We also have dinner speakers on topics such as new applications or emerging architectures.
We will videotape and post online all the lectures a few weeks after the program takes place. That way those who could not participate in the program and interested CS&E researchers will be able to benefit from part of the program. Viewing the lectures is no substitute for participation — missing are the hands-on sessions, the opportunity to ask questions of the lecturers and interact with world experts and the other participants – but we anticipate that many will find the lectures to be a valuable resource.
Our program is not the solution to training scientists and engineers to use high-end computing for their research but it makes a small contribution and perhaps will motivate more institutions to offer such training and influence university curricula to add some of the topics and treat them in greater depth that we are able to in the two weeks of the program.
Having been around scientific computing for over 40 years I have had the good fortune to observe through my colleagues many aspects of the scientific computing ecosystem. I hope the ATPESC will provide to a few of the next generation of computational scientists and engineers a similar broad view of the many facets of computing, so that as their research evolves they will have a rough idea of where to look for ways to tackle new problems.
Paul Messina is the director of science for the Argonne Leadership Computing Facility at Argonne National Laboratory.