Cray Lands $188m Blue Waters NCSA Contract, Breaking 10 Petaflops with Opteron-GPU Tag Team

By Timothy Prickett MorganGet more from this author

If there was a reason that Cray CEO Peter Ungaro, who formerly ran IBM’s high performance computing biz, was a little extra perky when the company announced its third quarter financials two weeks ago, it was not just because the SC11 supercomputing trade show was coming to Cray’s hometown of Seattle this week. Or that Advanced Micro Devices was once again late with an Opteron launch and had hurt its numbers.

It was because Ungaro knew that the National Science Foundation had ordered the budget for the “Blue Waters” petaflops-scale super to be rejigged and sent out for rebid – and Cray was in the hunt to win what turns out to be a $188m deal.

Bill Kramer, deputy project director of the Blue Waters project at the National Center for Supercomputing Applications at the University of Illinois, tells El Reg that Blue Waters was not a specific system, but rather a complete set of infrastructure, including a data center, plus computation, networking, and storage and, most importantly given the software goals of the NCSA, code that scales to real-world petaflops performance.

“We wanted to change the computational elements, and we got approval after a peer review,” says Kramer.

So out goes IBM’s super-dense Power 775 cluster version of Blue Waters, which Big Blue pulled the plug on back in August because the Power7-based machine was more expensive to manufacture than the company thought when it won the competitive bidding for the project back in 2007.

In comes the largest system that Cray has built in its history – at least until the 10 to 20 petaflops “Titan” ceepie-geepie hybrid that Cray is building for Oak Ridge National Laboratory is installed and if it is fully extended.

Like Titan, the Blue Waters system that Cray is building for NCSA will be a mix of standard XE6 Opteron blade server nodes and XK6 mixed CPU-GPU nodes, all linked together in a single network using the “Gemini” XE interconnect created by Cray. The XE6 blades will be equipped with eight sockets of 16-core “Interlagos” Opteron 6200 processors and the XK6 blades will have four Opteron sockets and one Nvidia GPU coprocessor per blade.

As with Titan, the Blue Waters system that Cray has pitched to NCSA will be based on Nvidia’s next-generation “Kepler” GPUs, which were expected by the end of this year when Nvidia outed its roadmap unexpectedly in September 2010 but which are clearly coming sometime in 2012, not in time for Christmas shopping this year.

The Kepler GPUs will offer a significant performance bump over the current 512-core “Fermi” graphics engines that are used in Nvidia’s GeForce and Quadro video cards and Tesla coprocessors. They are fabbed by Taiwan Semiconductor Manufacturing Corp and use its 28 nanometer processes, which should allow Nvidia to put a lot more cores into a GPU as well as boost the clock speeds a bit. Nvidia hasn’t said much about the Kepler GPU design, but it looks like notebook makers are going to get first stab at them early next year and that performance will be more than double what the Fermi GPUs do today. The top-end Fermi GPUs with all 512 cores running at 1.3GHz can deliver 665 teraflops of double-precision floating point performance.

The specifications of the rebooted Blue Waters system are still a bit in flux, but here’s what NCSA is getting for that $188m. The machine will have at least 235 XE6 cabinets, more than 30 XE6 cabinets, and more than 30 storage and I/O server cabinets. The resulting machine will have more than 49,000 Opteron 6200 processors and more than 380,000 cores, with another 3,000 Nvidia GPU coprocessors packing a hell of a floating point wallop as well. (If the Keplers offer twice the double-precision floating point performance as the Fermis, then those 3,000 GPU coprocessors will account for about 4 petaflops of aggregate oomph.) The plan is to put 4GB of main CPU memory per core into the machine, for a total of more than 1.5PB.

The XE6 and XK6 nodes will all be linked to each other using the XE interconnect through a 3D torus, which will require over 9,000 wires to link it all together. (That’s around 4,500 kilometers of wire if you strung it all out.) The Blue Waters machine is expected to have 11.5 petaflops of aggregate peak number-crunching performance.

Kramer says that the plan is to run the 16-core Opteron 6200s processors with half the cores sleeping a lot of the time and giving the remaining eight cores full access to floating point units that are shared by each integer unit in the Bulldozer modules. By doing this, NCSA will be able to run the cores in a Turbo Core mode that can add anywhere from 600MHz to 1.3GHz of extra clocks to the chip, depending on the Opteron 6200 processor that NCSA chooses for the machine.

“We studied this quite a bit, but for a lot of our applications, it makes sense to run it like an eight-core with dual 128-bit floating point performance,” says Kramer.

Cray is also building a storage subsystem for the Blue Waters machine, which will run the Lustre file system and deliver 25PB of usable disk capacity. It is not clear how that local storage will be linked into the system, that 25PB of capacity will be accessible through more than 1TB/sec of aggregate bandwidth from the cluster. There is an additional 500PB of nearline storage also being added to the machine, which will have 100GB/sec of bandwidth into the cluster. The external network coming into the Blue Waters machine will have 100Gb/sec of bandwidth at first, and will eventually scale up to 300Gb/sec.

The Blue Waters machine will run the Cray Linux Environment, the company’s tweak on SUSE Linux 11 that also has an Ethernet network compatibility mode that allows applications compiled for Linux machines clustered using Ethernet to run unmodified and in emulation mode on top of the XE interconnect.

The Blue Waters machine will be delivered by Cray in phases over the next six to nine months, with the initial delivery early next year and early program deployment by the middle of 2012. The plan is to have the full machine operational by the end of 2012, which is more or less the same timing that NCSA expected with the IBM Power7 Blue Waters behemoth. ®

This article originally appeared in The Register. It appears here in its entirety as part of a cross-publishing agreement.

 

 

 

Comments

  1. This sentence “The XE6 blades will be equipped with eight sockets of 16-core “Interlagos” Opteron 6200 processors and the XK6 blades will have four Opteron sockets and one Nvidia GPU coprocessor per blade.” sounds weird to me.

    I believe it is “one GPU per socket”, not “per blade” (-:

    F.

  2. 512 x 1.3GHz Fermi cores surely deliver 665 GIGAflops, not TERAflops eh?