Evolution or Revolution: two views on the path to exaFLOPS

Print Friendly, PDF & Email

Over the past 60 years HPC has changed in dramatic spurts punctuated by fairly long periods of stability. During that time we’ve been through five computing epochs in HPC, including sequential execution, sequential issue, vector, and SIMD models of computation processing. And along the way there have been excursions into dataflow, systolic, and global shared memory models.

Each of the phase changes from one model to another was precipitated by a prior change in hardware design and system architecture. Often our programming models were a consequence of the hardware designer’s art, not the result of a collaborative effort between software and hardware designers to develop the most sensible system.

Today we stand firmly in the epoch of communicating sequential processes, enabled primarily by MPI. As we look forward to achieving the goal of a sustained exaFLOPS system by the end of this decade, we asked two leading experts in the exascale community this question: are we on the cusp of a new phase change, a new revolution, in HPC, or can we extend and adapt today’s programming model to get us there?

Thomas Sterling Photo

Revolution as Catalyst for the Exascale Phase Change

contributed by Thomas Sterling
Arnaud and Edwards Professor, LSU
Distinguished Visiting Scientist, ORNL
CSRI Fellow, Sandia National Laboratory

 

What has driven HPC phase-change in past epochs has been recurring opportunities from constantly advancing enabling technologies, and their consequential challenges in fulfilling their inherent performance potential. Performance has been achieved through technology’s ability over time to deliver increased switching speeds, manifest as processor clock rates and memory access times, and the ability to concentrate exponential increases of logical, storage, and communication component devices per socket.

But as these trends continued through technology categories (e.g., vacuum tubes, transistors, integrated circuits), and their generational species (e.g., SSI, MSI, VLSI), the optimal hierarchy and order of component structure continuously changed, demanding periodic revolution in architecture, programming models, and supporting system software. The underlying essential requirement has always been the effective exploitation of parallelism in hardware form and software function to exploit both device speed and component structure complexity. As device capacity has increased, it has enabled new preferred optimal balance points of both with respect to the fundamental normalizing factors of time, energy, and cost.

Again the question: is supercomputer technology and system design at a revolutionary point of punctuated equilibrium (to borrow from biology), demanding a sharp transformation of architecture and programming methodologies in accordance with a paradigm shift in the governing foundational execution model?

The old songs don’t sound the same

Key trends that have emerged in recent years distinguishing future directions from the previous two decades strongly imply a pending revolutionary phase change in HPC.

The need for scalability, efficiency, energy effectiveness, reliability, and programmability are all challenged by these ensuing trends, demanding innovative responses in form, function, and foundation.

Among the most critical is the end of two major sources of performance gain: increasing clock rate and the exploitation of opportunities for design complexity. Because power consumption has hit critical practical limits (power is crucial at both ends of the scale spectrum from high end supercomputing to digital mobile communications), clock rates have largely flat-lined with little expectation of performance improvements being realized directly from increases in clock rate any time soon.

The classical reliance in increasing processor design complexity as a second source of performance gain through Moore’s Law has also run its course and is no longer a significant factor. Designers have exhausted instruction level parallelism and speculative execution (a major power hog) along with a host of other architecture tricks as sources of performance improvement. Anticipating dramatic gains in per processor core performance through increased design complexity is not viable.

Still continuing is Moore’s Law, at least down to 22 nanometers, and if specific fabrication problems can be addressed probably down to eight nanometers into the next decade. Therefore, a dramatic change in system architecture has occurred with the adoption of heterogeneous multicore organizations placing an ever increasing number of cores on the same socket (sometimes involving more than one die), and integrating additional GPU specialized accelerators in the aggregate nodes. These changes are very significant, although some would still refer to them as evolutionary.

The demand for a revolution

Parallelism remains as a means of continued exponential performance gain.

The past reliance on fine-grain processor core ILP (both deterministic and speculative) and coarse-grain process concurrency is stretched thin for today’s hundred thousand core petaFLOPS systems and will prove wholly inadequate for exaFLOPS at decade’s end. Parallelism in that era will be required both for sustained exaFLOPS, latency hiding, and sufficient over-subscription for load balancing and responding to non-uniformities of workload.

It is estimated that total parallelism required on a sustained basis will be greater than 10 billion-way. Conventional practices and models for exposing and exploiting parallelism are incapable of satisfying this challenging requirement through incremental and evolutionary improvements over the next ten years.

Within this new class of heterogeneous multicore structure, additional important trends impact growth in supercomputing architecture and programming.

Socket pins and their respective bandwidth is growing at best slowly while demand for inter-socket communications grows proportionally with the number of cores. Effective memory access latency is increasing and becoming increasingly non-uniform even within the context of SMP nodes while system-wide latency is reaching tens of thousands of cycles and may approach a hundred thousand cycles by decade’s end. Energy efficiency requires at least two orders of magnitude improvement if exaFLOPS performance is to be realized. Single point failure mode MTBF of minutes must be tolerated for systems at scale to be practical. Temporal and resource utilization efficiencies dissipated to single digit percentages in the last phase of HPC must be recaptured through superior runtime strategies, algorithms, and management mechanisms.

What the revolution will look like

A new revolutionary strategy for high performance computing is required to exploit new device technologies in order to achieve exaFLOPS performance by 2020 or earlier. Here are outlined potential elements of such a strategy and their value in meeting the challenges discussed above. Key ideas that will alter future HPC system structure and operation from those of the current generation include (but, of course, are not limited to):

Active Global Address Space (AGAS) — provides a system wide name virtual name space for logical access to all first class objects in the system such as variables, processes, threads, and local control objects. A critical distinction between AGAS and experimental PGAS-based methods is that named elements may migrate across system physical nodes without having to change their virtual names. While not cache coherent, AGAS is necessary for dynamic data distribution and task migration while providing lower overhead for lightweight access essential to high efficiency. Architecture and operating system design changes will be required to support this important facility. Programming models will be devised that can employ a global name space for unified application specification and portability.

Parallel Processes — provide semantic contexts for parallel activities and data distributed across multiple localities (e.g., nodes). Parallel processes are ephemeral, dynamically allocated to physical resources, permit scoping of accessible name space through the vertical hierarchy of parent-child processes, and object-oriented protected access to sibling or cousin processes. Parallel processes are essential for portability across machine classes, scale, and generations, and make possible dynamic resource management. Architecture and runtime support are necessary for high efficiency and programming models must be devised to give users accessibility to this powerful medium of computational organization, abstraction, modularity, and interoperability. I/O access and file management is abstracted through inherited processes at the top level unifying access protocols among distinct system architectures.

Parallel Threads — are the primary abstraction of user defined actions. All threads are defined within its host process. While parallel processes may span multiple nodes, each thread is performed only within a single locality at a time (although theoretically it may migrate like any first class object among localities). A parallel thread is a first class object and can be manipulated by other threads. As an abstraction it incorporates a hybrid control model combining global mutable state requiring synchronization with a static dataflow control model for private intermediate values (single assignment semantics). By eliminating many anti-dependencies, maximum flexibility in fine grain parallelism of processor core architecture is permitted in support of heterogeneous computing including the automatic use of GPU accelerators.

Impact on architecture, runtime system software, and programming models with compiler support will be necessary to achieve this additional level of parallelism. Threads can be directly invoked by parent threads on the same node and process or indirectly instantiated on a different node through the use of parcels. It is this close relationship between threads and parcels (see below) that will provide exascale machines with symmetry of semantics in the presence of asynchrony. Threads are critical to both latency hiding and delivery of sufficient parallelism to achieve exascale.

Message-driven Computation via parcels — an improved class of active messages, enables the movement of work to the data anywhere in the system for lower communication time and energy, contributes to system-wide latency hiding, and permits the migration of flow control (a major change from fixed point program counter based computation). Architecture and runtime system support for parcels will be critical to effective lightweight messaging, which will contribute to exploitation of the resulting increased parallelism and therefore scalability necessary to reaching exascale performance. Parcels are needed to achieve a major change, maintaining symmetric semantics between local action, and the invocation of remote action; needed for both portability and dynamic resource and task management.

Local Control Objects (LCO) — are object-oriented constructs used for lightweight synchronization with diverse and rich semantics. They are essential for the elimination of global barriers and the BSP model widely employed today. While they can be used to realize conventional synchronization structures like semaphores and mutexes, far more important are the advanced means they provide in the form of producer-consumer control, dataflow multivariable action dispatch, futures from the actors model, and others.

They can easily support lazy, eager, and strict computation and provide coordination among multiple incident actions on the same structures without resorting to critical sections, which imposes significant sequentialization. LCOs in distributed ensembles can be used to achieve distributed control operations such as managing widely separated copies, insertion or deletion of vertices within dynamic irregular graphs, or for supporting AGAS. LCOs will be used heavily by runtime systems, implicitly employed by advanced programming models, and will be supported by hardware support mechanisms in new processor core architectures. Compilers will take advantage of them to synthesize higher-level control methodologies such as data directed computing for traversing graphs, event driven computation, or non-deterministic relaxation algorithms. Ultimately they are critical for exascale in releasing diverse forms and sizes of parallelism not available on conventional platforms.

Memory Accelerators — permit an additional modality of processing to those available conventionally. Typical processors are dependent on the ability to exploit temporal and spatial locality, often committing large parts of die resources to multiple layers of caches and their control logic. These are power hungry, spatially wasteful, and can be performance poor when little or no temporal locality is available, such as in the wide range of graph problems. Lightweight Embedded Memory Processors (EMP) buried close to the memory banks can minimize access latency and maximize access bandwidth providing the most efficient (energy and time) operation. EMPs are small, low power, and multithreaded with otherwise simple structure, no speculative execution, and low clock rate. Nonetheless, they can greatly improve scalability and performance when combined with numeric intensive core families such as found in today’s GPUs.

Runtime system software — will emerge as an important new layer to future HPC systems bringing critical time dependent functionality to highly parallel application execution. This software layer will take over many of the responsibilities conventionally accorded to operating systems, but within the user space will be much more efficient and adaptive to its dedicated user application requirements. Unlike operating system software, which is persistent, the runtime system is ephemeral, existing only as long as the application instantiation it supports. The runtime system will be closely tied to both the architecture and the operating system to balance resource availability with application demand.

Fault tolerance — which is reserved for specialty systems today or facilitated through checkpoint/restart techniques, will be achieved through innovative techniques such as the proposed compute-validate-commit cycles using micro-checkpointing in memory. This will require close cooperation among the architecture (for fault detection and reconfiguration), operating system (for configuration management and interrupt handling), runtime system for managing the cycle, and the compiler for delineating where micro-checkpointing is to occur in the execution trace as well as using inverse functions when available for reinforced quality control.

Energy brokering — will treat energy as a finite resource which may be less than all physical resources could consume at peak in unit time. Which parts of the system that will run at any particular instance will be determined by the prioritization of the action they are to perform and the energy cost of that action. Architecture and operating system, in conjunction with compiler guided runtime, will arbitrate among energy demands and seek to utilize this resource on average within practical constraints of sustained power and cooling. However, brief episodic bursts of activity can exploit system thermal mass and capacitive storage to accelerate very brief spans of highly sequentialized work to minimize the Amdahl effects and improve scalability.

(Still) No free lunches

These and other revolutionary concepts, some of them represented by experimental prior art over two to three decades, will be essential if sustained exaFLOPS performance is to be possible for real world applications. But to achieve this will require major accomplishments through extensive research and development over the next ten years.

The first and foremost requirement is to establish a new foundation for future systems through the development of a new model of computation incorporating the mechanisms, semantics, and policies described above. Research must be conducted to determine the distribution of roles, responsibilities, functionalities, interoperability, and protocols among the layers of the system. This includes fundamental work on runtime systems and the development of a new class of EMP core architectures.

While low level programming tools can be devised and deployed to support the required research with application kernels, further research will be needed over time to create new programming languages and support environments.

Finally we need a new operating system class based on lightweight kernels with a new theory of self-aware functional capabilities to ensure optimal resource utilization, reliability, power usage, even security. Fortunately both DARPA and DOE are undertaking fundamental research in hardware architecture, system software, programming models and tools, and applications and algorithms in preparation for exascale by 2020.

Almost 20 years ago the community came together to develop both MPI and Linux, two of the most widely used software environments in high performance computing. Now, the community comprising all contributors, developers, sponsors and stakeholders, both nationally and internationally, and including industry, academia, and government, must once again join in a common shared enterprise to create the revolutionary, future exascale systems.

Realizing the revolution

We are in a rare period of innovation and excitement with a multitude of concepts and approaches. Real research of an exploratory nature must be conducted because we do not know the answers.

Nor do we have the time or the resources to let a thousand flowers blossom. A diligent multipath aggregate program trying a few of the most likely options with interdisciplinary studies combining applications, programming, system software, and computer architecture experts with government and industry sponsorship is needed. Community-led forums must play a critical role, especially early on, to encourage the sharing of knowledge and perspectives, and the gestation of new ideas stimulated by the cross fertilization of a multitude of ideas. Revolution and the resulting paradigm shift that will emerge will catalyze the new class of systems and methods that will drive the field forward to a new generation of machines and the applications they will enable.

Pete Beckman Photo

Getting to Exascale does not Demand a Revolution

contributed by Pete Beckman
Division Director, Argonne Leadership Computing Facility

 

Scientists love grand challenges, thought experiments, and rewriting conventional wisdom.

For decades, the mammoth investment in chip and networking technology, fueled by consumer electronics and the Internet, has provided scientific computing ever-faster systems. However, just as the excitement of exascale computing has started to build within the scientific computing community, so has the concern that to achieve exascale, we may need to start over from a clean design, and everything from scientific application frameworks to system software and hardware architecture will need to be reinvented.

Sizing up the future while riding exponential technology improvements is daunting, and frankly, we have not always been too good at seeing more than a few milliseconds into the future. I would guess that most HPC geeks still have several t-shirts and coffee mugs from supercomputer companies that no longer exist. So as we look into our exascale destiny, three orders of magnitude into the future, where are the grand challenges? What needs a more clever revolutionary solution, and where will some hard work on an evolutionary path get the job done?

While there are many problems that need to be solved, and I love disruptive revolutionary ideas, there are some evolutionary paths that with hard work and innovation will lead the community to exascale.

Parallelism has changed everything, but we have always resisted fully embracing it. The community resisted the move away from simple-to-program vector machines with a handful of vector units to massively parallel systems. Adding parallelism seems to always have been done reluctantly.

Fewer and faster

Everyone knows the computer architect’s rule of thumb, “fewer and faster is better”.

Programmers would rather have half the number of CPUs running twice as fast than double the number of CPUs running at half the speed. So over the decades, the community has tried many abstractions and invented a myriad of tools to hide or simplify the parallelism seen by the programmer — yet the need to add parallelism has never abated — with each generation of top-end system more parallel than the previous.

As a simple example, consider the scalable IBM Blue Gene architecture that has custom nodes in each rack. In the first generation, Blue Gene/L, programmers saw 2,048 cores tasks per rack, usually programmed as MPI processes. Intrepid, the Blue Gene/P at Argonne National Laboratory that serves the DOE INCITE community presents the user with 4,096 cores. The next generation system, Blue Gene/Q, will present the user with 16,384 to 65,536 parallel threads of control, depending on the programming model.

The accelerating growth of parallelism

Technology is pushing parallelism faster than ever before. There are three key reasons for this change.

The first and primary reason for the shift is electrical power. While in the past computer architects tried to adhere to the rule of thumb “fewer and faster is better,” when it comes to managing power, “more and slower is cheaper”. It is significantly cheaper to have twice the number of CPU cores running at half the clock frequency. Sadly, this seems to be a universal rule for most macroscopic phenomenon, including how to save on gas and improve vehicle mileage (drive slower). And while on current systems electrical power consumption is already high (on the order of a handful of megawatts) exascale systems could be many times more costly unless low-power technologies are vigorously pursued. So as we embrace getting more flops per watt as we scale large systems, we must also embrace parallelism.

Another important reason for the explosion of parallelism on the path toward exascale is packaging.

For chip designers, “closer is cheaper”. Pushing data back and forth across intra-chip links takes significant power and is limited by speed. A key chip trend is toward fusion — bringing together several architectural components onto one die. Over the decades we have seen more functionality fused together, from the cache to the floating-point unit, to the memory controller.

Most chip and electronic vendors have products or plans for even higher integration, leveraging new packaging technologies. For example, the Apple iPad and iPhone 4 both use a custom built CPU, the A4, where memory and a specialized graphics processor are integrated with the CPU. This packaging trend will lead us to consider building exascale systems with next generation extremely dense stacking technology. Parallelism will increase.

Finally, the rapid explosion of parallelism has also been fueled by the architectural trend toward multi-core. And while the drivers toward multi-core are numerous, and have been well studied, the impact could hardly be understated — parallelism multiplied.

The exascale hurdles

For exascale to be practical, it must be affordable ($100M to $200M per system), and not much harder to use than our current extreme-scale systems.

Of course, some people would say that current platforms require too much specialized knowledge, are too hard to program, and achieving good performance is impossibly difficult. However, I disagree.

At this extreme scale, problems are difficult and hardware and software architectures are complex. Maybe we could say, “programming a supercomputer should be as simple as possible to achieve good performance, but no simpler.”

We can do far better, and innovation and investment are required. However, scientific discovery via advanced simulation and modeling is a wonderfully grand challenge and cannot be reduced to a couple lines of a Python script. I don’t think we should expect that it should become easier as we peer deeper and deeper into the fundamental workings of our universe.

Furthermore, the HPC community is growing, and demand for supercomputing cycles has never been higher. At Argonne, we are planning to deploy an IBM Blue Gene/Q in 2012. To help bootstrap the community, we announced at SC09 last year the Early Science Program, which would give roughly 15 projects early access to Blue Gene/Q to speed porting, scaling, and performance tuning. We had over 43 science teams apply!

Yes, extreme-scale systems are complex, but the number of skilled and competent teams that can use extreme-scale systems continues to grow. To maintain this success as we move toward exascale will require key investments in software to adapt our current models and practices to embrace even high levels of parallelism. If the past is any indication, increasing parallelism at all levels, is very challenging indeed.

Getting there: a conceptual approach

Architecturally, we can expect the trends described earlier to result in systems that support billions of threads of concurrent execution. The decades of investment in systems software — everything from the operating system to the messaging layer and math libraries — provides the community with a fantastic base for exascale computing. However, all of those layers must be reworked to support extreme parallelism.

In the past, we have scaled our software rather slowly. In fact, for many years the parallelism within the largest systems did not really grow substantially, only the clock speed increased. From the programmer’s perspective, clock speed gains, while they lasted, were absolutely divine. A programmer could unpack a bundle of code, do a “configure; make; make install”, and watch the newest processor zip through the old code. Every new system just felt faster, from the compiler to the data analysis framework.

However as we move toward new architectures, increasing parallelism in the software stack and programming model will become the key technical shift that the community must embrace. Extracting parallelism will not come cheaply: it will require significant investment, since almost every layer of the software must be adapted to be more parallel. Consider for a moment how many code segments loop over the number of cores, threads, or nodes, or allocate a small chunk of local memory for each node on the system. As we move to billion-way system concurrency and hundreds of cores, we must begin to adopt scale-invariant programming techniques.

What has to evolve to enable exascale?

The programming model for applications is due for a change.

For more than a decade, most codes have used an MPI-everywhere model. There are tremendous advantages to this model. MPI programs are relatively easy to reason about and debug, and have demonstrated they are very scalable. Other models, including shared memory threads, have had difficulty scaling and are prone to race conditions or correctness errors. And while scientists have been exploring the fundamental limits of MPI implementations, many would agree that parallelism within a node might be best exploited with a different model. Unfortunately, no common intra-node programming model is both easy to use and scalable has gained widespread adoption. The community needs to quickly address this key area.

Another important area that will require significant breakthrough is I/O and data storage. The weak point for many current systems is parallel storage system. Users who receive system status emails from their local supercomputer are all too familiar with filesystem-related outages.

Disk speeds are not increasing rapidly. Flash or phase change memory could replace disk eventually, but current designs are not optimized for I/O bandwidth, but for capacity. As persistent storage becomes more affordable, it will be woven into the fabric of the computing nodes. Storage architectures will radically change, and the system software to push and pull mountains of data though the system will need significant innovation.

Finally, the need for innovation in fault tolerance and resilience will be important for exascale computing.

There have been many scientists sounding the alarm, and they are often too shrill and too pessimistic. Developing improved fault management technology is important as we build larger systems, but predictions that exascale systems will have a failing component that could be seen by user code every few seconds or minutes are not realistic.

At the Argonne Leadership Computing Facility we are often asked about our mean time between failures (MTBF). We usually respond by answering the question they intended ask, “What is your mean time to interrupt?” On a system as large as Intrepid, components fail daily. However disks are arranged in RAID configurations, power supplies are redundant, disk adapters can fail-over, network cards often have multiple links, network switches route around failed line cards, and memory self-corrects many errors.

None of these errors cause the system to crash or a job to terminate. Sometimes performance degrades for a period of time, but on the whole, our supercomputer is very stable — even in the presence of failures. The question for exascale computing is what technologies and innovations are needed to continue checking and correcting before the user notices a problem? The community needs to explore what technology can be employed to respond to faults before they impact the user, as well as what low-level chip designs could improve resiliency. User-level checkpointing is of course the current answer, but a fresh look at ways user codes can respond to fault is needed.

Getting there

Collaboration will be the key to getting to exascale by 2020 or earlier.

There are two places where the community is already beginning to work closely together. The first is the International Exascale Software Project (IESP) funded by the NSF and DOE.

The IESP was launched at SC08 and had its first meeting in Santa Fe, New Mexico in the spring of 2009. The goal of the project has been to bring together scientists from around the world to identify the key software challenges for exascale computing. There have been four meetings, with the most recent meeting in Oxford, UK last April. Collectively, the community has created a software roadmap for exascale computing. It includes the issues described above, as well as exploring models for how the open source software community and vendors can work closely. Providing a roadmap for the key areas where innovation and investment are required helps broaden the effort to include scientists from around the world, as well as focus the work to address the most important components. As our platforms and software becomes more complex, we need improved models of collaboration and coordination.

The other key nexus for collaboration between comprising industry, academia, and government are the recently announced co-design centers that the DOE will begin funding by the end of the year. The goal for a co-design center is take an application domain, such as nuclear energy, and work with the vendor to both evolve and scale the current application codes together with the design for the next generation platform. In this way, the simulation and modeling codes can affect the design of the next platform, and the platform can be designed to run the application code.

Naturally, significant give-and-take will be required. There will be architectural tradeoffs that could significantly benefit applications codes, provided they switch to a different algorithm, or restructure their parallelism. The IBM Blue Gene architecture, which was the result of collaboration between Argonne National Laboratory, Lawrence Livermore National Laboratory, and IBM, was designed in a partnership that balanced power, performance, cost, and usability for the architecture. Co-design centers will broaden this concept to include a larger community, universities, and more applications. I expect that this will be the new model for all future platforms.

The future is exciting, and as we move toward exascale computing, we will take on new and interesting grand challenges and rewrite conventional wisdom, building on our existing software and decades of experience.