The US Department of Energy used its massive budget to push supercomputers to gigaflops, teraflops, and petaflops in the prior three decades and it is being tasked to put the pedal to the exaflops metal before the end of this decade.
To get there, the DOE has to fund primary research at IT vendors who might otherwise not get around to it until it suited their own commercial needs. It has to also foster collaboration across vendors who might otherwise rather not share ideas, because no one vendor is going to be able to solve the exascale problem by itself.
The main vehicle for funding exascale computing is called the Extreme-Scale Computing Research and Development program, which is being funded by both halves of the DOE. That would be the Office of Science, which funds scientific research in the nuke labs, and the National Nuclear Security Administration, which runs simulations to make sure the US military’s nuclear warheads work since Uncle Sam can’t set one off thanks to the Nuclear Test Ban Treaty. There is talk that the supers at the DOE labs aren’t just making sure existing nukes work, but also helping to redesign them.
The first phase of the DOE’s exascale system funding is called FastForward, which is being administered by Lawrence Livermore National Laboratory in conjunction with the six other primary DOE nuke labs (some of which dislike being called nuke labs even though they do nuclear physics research).
Those other DOE labs, along with LLNL, are the name brands in high performance computing in the United States: Argonne National Laboratory, Lawrence Berkeley National Laboratory, Los Alamos National Laboratory, Oak Ridge National Laboratory, Pacific Northwest National Laboratory, and Sandia National Laboratories.
The FastForward exascale research program issued its request for proposals on March 29, and asked that they be submitted by May 11. The program seeks to fund basic research in exascale computing as it relates to three areas: Memory, processors, and storage and I/O.
It has an explicit goal of trying to solicit cooperation across multiple companies, much like the US Defense Advanced Research Project Agency’s Ubiquitous High Performance Computing program. In a way, the UHPC program at DARPA is the trailblazer for the FastForward program at DOE.
DARPA always first to fight
The UHPC program was announced in March 2010 with the goal of creating an HPC system that by 2018 can do 50 gigaflops per watt (BlueGene/Q, the current top performer and most efficient super in the world, can do a little more than 2 gigaflops per watt) and pack 10 petabytes of storage and do around 3 petaflops of number crunching into a slight larger server rack than is standard and within a 57 kilowatt power budget.
Building an exascale system would seem easier, by comparison, since there is, in theory, no limit on the size of the machine or its power budget. But in reality, there are big-time power limits on exascale supers because no one is going to build a 20 megawatt nuclear or coal power station to keep one fed and cooled.
In August 2010, two teams were awarded UHPC ExtremeScale contracts with a total of $74m: one lead by Nvidia and the other Intel. Nvidia got a $25m grant and has teamed up with Cray, Oak Ridge National Lab, and six universities. Intel teamed up with three universities, SGI, Lockheed Martin, Cray, Reservoir Labs, and ET International to take down a $49m grant.
In three related UHPC grants, Sandia National Lab has teamed up LexusNexus and two universities, MIT has its own grant, and so does Georgia Tech, apparently. Total funding for the UHPC effort is said to be on the order of $100m, but DARPA has never confirmed that figure.
Three steps to DOE-sponsored exascale computing
With the FastForward program, the DOE is setting a cap of $20m on any proposals to try to encourage focused work on specific problems, and said at the get-go that what it was looking for was more like two $10m proposals in each of the three areas of primary research.
It is not clear how many awards have been made yet – the vendors are not notified of who was bidding and who won, but rather that they won. At the moment, AMD has been awarded a FastForward contract for processor and memory research and Whamcloud has one contract for storage and I/O research. There could be – and probably will be – others getting grants. Uncle Sam likes to hedge its HPC bets.
Once the primary research on possible exascale technologies is completed over the next two years, DOE will be looking at funding vendors to put together prototypes – this is tentatively called the system design phase – and then, by 2020, to build full exascale systems based on those prototypes – known as the system build phase at the moment. DOE will no doubt come up with other names later on.
According to the statement of work (PDF) for the FastForward contract, the issues that vendors face on the exascale challenge are daunting.
On a current petaflops-class system today, it costs somewhere between $5m and $10m to power and cool the machine today, and extrapolating to an exascale machine using current technology, even with efficiency improvements, you would be in for $2.5bn a year just to power an exascale beast and you would need something on the order of 1,000 megawatts to power it up. That’s 50 nuclear reactors, more or less. The DOE has set a target of a top juice consumption at 20 megawatts for an exascale system.
Using DDR3 main memory today, a 2 petaflop machine with 2PB of main memory burns about 1.25 megawatts, and assuming that we can get to DDR5 main memory by 2020, we’re talking about needing 260 megawatts just for the memory subsystem in an exascale box. Even if you cut the memory-to-flops ratio by a factor of five, which many people don’t think is a good idea, and you are above 50 megawatts just for the memory subsystems across a cluster.
In addition to power consumption, memory components are not getting as cheap as CPU components, and memory bandwidth is not keeping up with the ever-increasing core count on processors and thus memory latencies are increasing.
There are resiliency issues with all of the components in an exascale system, which will have large numbers of components frying all the time. And then you are going to have billions of compute elements, and there has to be a hierarchy of memory and interconnects to keep them all fed and communicating with each other as simulations run.
Worse still, programming these petaflops machines is a complete bitch, and an exaflops system will be in the range of old battle-axe mother-in-law. Beyond that, you are programming against Death.
On the processor front, during the FastForward phase, the DOE is looking to better measure and control the power use in processors and integration with memory, network, and optics from the CPU or hybrid CPU-GPU chip, as the case may be. On its wish list, the DOE wants automatic rollback after faults or synchronization errors and better fault detection and correction.
Boosting the movement of data onto and off of the chip is also key, as is handling collective operations across compute elements, and software-controlled placement of data on the chip and its memory hierarchy is also penciled in. Putting network interfaces on the processor is a requirement, and so is boosting the concurrency across many cores and many threads on the cores.
The compute elements of the FastForward potion of the project have to provide 50 gigaflops per watt at scale – that’s the same level of performance per watt that DARPA is looking for its ExtremeScale UHPC project. The system has to have a mean time between application failure of six days or larger.
This doesn’t sound so great until you realize the system will have trillions of components and that today, with petaflops-class machines, it is on the order of one to five days and, without check-pointing or other resilience mechanisms will drop to about six hours by 2020.
DOE would like to have compute nodes with more than 10 teraflops of double-precision number-crunching performance, 4TB/sec of aggregate memory bandwidth and more than 100GB of main memory; something on the order of 32GB to 640GB is preferred. Total bandwidth between a node and the interconnect that lashes them together should be in excess of 400GB/sec.
The burden of memory
On the memory front, DRAM failure rates are higher than expected and density improvements in memory chips are not coming fast enough. So DOE wants researchers to explore the use of in-memory processing – literally putting tiny compute elements in the memory to do vector math or scatter/gather operations – as well as the integration of various forms of non-volatile storage into exascale systems.
The nuke labs are thinking that 500GB of NVRAM Of some sort per socket will do the trick. While 4TB/sec of bandwidth is a baseline, DOE really wants 10TB/sec.
Parallel storage subsystems generally hold up better than compute nodes on exascale systems these days, with the DOE estimating that the meantime between application failure due to a storage issue being around 20 days. Without any substantial changes to storage architectures, that will drop to 14 days by 2020. Disk capacity is increasing at a decent clip, but disk performance is not. Solid state drives are fast, but they ain’t cheap.
If availability is not as big of an issue for exabyte-class storage, then scale surely is. That exascale system in 2020 will have between 100,000 and 1 million nodes, and will have somewhere between 100 million and 1 billion computing elements, with somewhere between 30PB and 60PB of memory, and across which some sort of concurrency will have to be provided to run applications.
This behemoth will require from 600PB to 3,000PB of disk capacity. In effect, the disk array for an exascale compute farm will be an exascale system in its own right, with peak I/O burst rates on the order of 200TB/sec and metadata transaction rates on the order of 100MB/sec.
For the FastForward storage research projects, DOE wants a storage system that can keep the fully running exascale system fed, without crashing, for 30 days or more, and the mean time between unrecoverable data loss should be 120 days or higher – and do so with the storage array crammed to 80 per cent of capacity and performing full memory dumps from the system every hour.
Data integrity algorithms for storage can impose no more than 10 per cent overhead on the metadata servers at the heart of the storage array. Metadata insert rates are expected to be on the order of 1 million to 100 million per second, and lookup and retrievals are expected to be on the order of 100,000 to 10 million per second out of the metadata servers.
During peak system writing and reading operations, the metadata servers can’t take any more than a 25 per cent performance degradation hit, and DOE would really like to be 10 per cent.
No big deal, right?
So, good luck, AMD, Whamcloud, and friends.
AMD received research grants under the FastForward portion of the DOE Extreme-Scale Computing program for both processing and memory research, and according to Alan Lee, corporate vice president for advanced research and development at the chip maker, the reason is because the two are interrelated.
Lee was not able to elaborate much on the research plans AMD has put together, but he did confirm to El Reg that AMD would be focusing on research to push its hybrid CPU-GPU processors, what the company calls its Accelerated Processing Units or APUs. On the memory side, AMD is looking a different types of memory, different structures and hierarchies of memory, and different relationships between these memories and the APUs, and that this will, of course, necessarily involve system interconnect work.
“Moving data around to feed the beast is critical for exascale,” explained Lee, adding that the SeaMicro acquisition earlier this year was not done for this DOE work, but the interconnect expertise that AMD gained through that acquisition would be put to good use.
AMD researchers have already identified a subset of key memory technologies that they think will be applicable to exascale-class systems, and this is what the research will focus on. AMD is not throwing the whole kitchen sink of possible volatile and non-volatile memories into the mix.
Lee was not at liberty to say what memory technologies AMD was looking at – that would be helping its inevitable competition. AMD has received a grant of $3m for the memory research and $12.6m for the processor research. It is interesting that AMD was able to bag these contracts all by its lonesome specifically after the DOE said that it wanted multiple companies cooperating on the work.
On the storage front, Whamcloud, the company that was formed in July 2010 to support and extend the open source Lustre file system, is the leading contractor and is soliciting help from a bunch of others.
Whamcloud is managing the project and lending its Lustre file system expertise and is relying on HDF Group for application I/O expertise, EMC for system I/O and I/O aggregation skills, and Cray for scale-out testing of the storage systems. This exascale storage system will have a mix of flash and disk drives.
The word on the street is that Whamcloud received around $8m for its FastForward grant. ®