On the Practicality of an ExaFLOPS Computer System by 2020

The view from Steve Wallach, Co-Founder and Chief Scientist, Convey Computer

The minimum and maximum boundaries of what it will take to get an exascale system are pretty well established. For example, there are well-defined limits on the power consumption and space of an exascale datacenter. And we know what a petaFLOPs system looks like (because there several on the Top500 list). So it should be relatively easy to explore the probabilities (literally) of developing technologies to take us from petaFLOPs to exaFLOPs. In fact, let’s go one better: let’s explore what it takes to deploy a system that sustains one exaFLOPs (EFLOPs) on the NxN Linpack. (After all, everyone knows that the system peak performance doesn’t matter — it’s Linpack performance that makes it real!)

Any thoughts on the hardware architecture of an exascale system must be complemented by thoughts on the programming model. In DOE Exascale Initiate parlance, the hardware/software environment must be co-designed. Thus, our exploration of exascale realms includes considerations of the programming model and what it will take to deploy applications.

Application-specific computing: it’s not just a good idea, it’s the bottom line

Arguably, at the hardware level the architecture of an individual node is the most important component of an EFLOPs system. Sure, we’re going to have to have extreme resiliency. (Someone once said that the reason an exascale system must be so fast is that it has to be to solve a problem before it breaks!) And a fast interconnect. But at the heart of the discussion is the ability to get 1,000 times the performance of today’s systems with ~10 times the power.

The only way to do that is to make better use of the transistors used to solve a given algorithm. A string of general-purpose instructions necessarily wastes time, heat, and real estate—and that’s never going to get us to 1,000 times the per node performance.

Several years ago computer science researchers at U.C. Berkeley came up with the idea of a “computational motif” — a set of essential computational patterns found in almost all HPC applications. The point is that one size doesn’t fit all. Put another way, an application-specific computer system designed to tackle an individual motif will by definition be more efficient than a general-purpose engine.

In our case, we’re talking about a system that gets us an EFLOPs on the NxN Linpack, so the application is essentially “matrix multiply.” How do we build a Linpack-specific node such that we “only” need around 16,000 of them to get us to exascale? First of all, the system must be much more efficient than today’s systems. For example, in 1996, the leader in the NxN Linpack (the NEX SX/4/20—a vector machine) used 95 percent of the system’s peak performance. The leader last year of the Top500 (the ORNL Jaguar system) uses 75 percent of peak to attain 1.76 TFLOPs. This year’s leader, the Chinese Tianhe1A, only sustains 53 percent of peak. This is going in the wrong direction.

Bottom line: A heterogeneous architecture, tuned to the problem being solved, is virtually a requirement for a practical exascale system.

Moore’s Law, FPGAs, and PGAS, oh my!

Convey's Steve Wallach

Steve Wallach

Let’s take a look at the individual components of a hypothetical node, and the technology involved in attaining the performance needed:

There will be ~8x the logic gates available in a node than today’s systems, at roughly the same clock rate as today’s hardware. Moore’s law gives us the gates, however, the “power wall” keeps clock rates level. A node will be around 2 Kilowatts and occupy 3-4U of rack space.

The computational architecture of a node will be application-specific. In the case of Linpack, the node will be oriented around matrix manipulation. (In fact, at Convey we have already developed a general-purpose matrix/arithmetic identity for our reconfigurable systems). The heterogeneity of a node is almost certainty.

The base components of the system will be reconfigurable, implemented in FPGAs. This includes the memory subsystem as well as the computational units. Structured grids, graph traversal, sparse matrix, etc. get optimal performance with different types of memory systems. The use of FPGAs is likely (especially if FPGAs have hard IP floating-point gates), although custom ASICs or GPGPUs are alternatives.

The system software and development tools will understand the “exa” nature of the architecture. For example, the compilers and underlying hardware global address space will be PGAS (partitioned global address space)-ready—which means they are designed to support languages like UPC and Co-Array Fortran. While attaining an EFLOPs on Linpack doesn’t necessarily require a PGAS programming model, commercial acceptance of exascale technologies definitely demands it.

Given these components, if everything lines up we will get ~800 times the performance of today’s ~80 GFLOPs node, or 64 TFLOPs per node. With a high-performance interconnect (probably optical), “only” 16,000 nodes will be required to reach 1.024 EFLOPs. Such a system would require 32 Megawatts of power in 1,142 racks (although we’re not including I/O and interconnect power overhead in this calculation).

As upside probabilities, advances in several areas could reduce the power requirements. For example, 3D semiconductor packaging holds promise to reduce power in the chip-to-chip interface overhead. More floating-point functional units per node would reduce the number of nodes required. And, depending on the “other” uses of the system, the amount of physical memory could be reduced (effectively reducing the bytes/FLOPs ratio).

In short, using application-specific, reconfigurable computing technologies, combined with advances in semiconductor and FPGA technologies, there is a high probability of constructing a practical EFLOPs system. It will just be a small matter of time, money, software development, and a place to put it.