Is the US Back on the Road to Exascale?

Print Friendly, PDF & Email
Tom Wilkie, Scientific Computing World

Tom Wilkie, Scientific Computing World

After two years of pessimism, the US Supercomputing Conference, SC14, held this year in New Orleans in late November, was suffused with confidence about the future. The change of mood was triggered by the announcement, on the Friday before the event opened, that the US Government was to spend $325m on two new supercomputers, and a further $100m on technology development, to put the USA back on the road to exascale computing.

The significance of the announcement goes far beyond the specialism of high-performance computing (HPC) into enterprise computing, where the technologies being developed for HPC could transform this much wider and financially more important sector of the economy, according to the members of the winning partnership of IBM, Nvidia, and Mellanox. ‘There are game-changing elements to what we are doing,’ Ken King, general manger of OpenPower Alliances at IBM, told Scientific Computing World.

The decision by IBM to open up its Power architecture as a way of fast-tracking technological innovation – collaboratively, rather than by one company going it alone – also raises an interesting question: if a company the size of IBM feels it cannot develop exascale technology by itself, then can any other computer companies offer credible exascale development paths unaided?

IBM briefings during the week of SC14 understandably had an air not just of the cat having got the cream but rather the keys to the whole dairy. Together with Nvidia’s Volta GPU and Mellanox’s interconnect technologies, IBM’s Power architecture won the contracts to supply the next-generation supercomputers for the US Oak Ridge National Laboratory and the US Lawrence Livermore National Laboratory.

Although £325m is now coming the consortium’s way, Ken King stressed that: ‘From our perspective, more important than the money is the validation of our strategy – that’s what’s getting us excited.’ As Sumit Gupta, general manager of accelerated computing at Nvidia, put it in an interview: ‘IBM is back. They have a solid HPC roadmap.’

The decision marks a turn-around in IBM’s standing in high-performance computing, as its reputation was tarnished when after four years of trying, it pulled out of a contract to build the Blue Waters systems at the US National Center for Supercomputing Applications (NCSA) at the University of Illinois in 2011. Originally awarded in 2007, the contract was reassigned to Cray which fulfilled the order at a cost of around $188m.

In the corridors of SC14, the consensus was that the announcement was an endorsement of IBM’s decision to open up its Power architecture to members of the OpenPower Foundation and thus build a broad ‘ecosystem’ to support the technology. Gupta pointed out that IBM could have tried to go it alone, but decided to partner with Nvidia and Mellanox via the OpenPower Foundation, and work with them on the bid. ‘Opening the Power architecture – this is the new roadmap and validates what we have done together. When given a fair choice, this is the preferred architecture’.

The fact that both Oak Ridge and Livermore chose the same architecture was widely seen as a powerful endorsement of this technology development path, particularly as the two laboratories were free to choose different systems because they are funded from different portions of the US Department of Energy (DoE) budget – Oak Ridge from the Office of Science and Livermore from the National Nuclear Security Administration.

David Turek, Vice President of Technical Computing OpenPower at IBM, pointed out that Livermore has no accelerator-based applications but is now choosing heterogeneity and, he claimed, it was the application engineers at Oak Ridge who were pressing most strongly for the system. ‘The jump to IBM is significant,’ he said.

The third member of the Collaboration of Oak Ridge, Argonne, and Lawrence Livermore (CORAL) project, Argonne National Laboratory, is also funded by the Office of Science within DoE and is therefore constrained to choose a different system from Oak Ridge’s. The Argonne announcement has been deferred into the New Year.

The delay has inevitably prompted speculation that Argonne too would have preferred the Power-based solution. After all, Argonne’s current machine is an IBM Blue Gene/Q – called ‘Mira’ – that already uses 16-Core PowerPC A2 processors. But the laboratory was constrained by the purchasing rules to opt for another choice.

Cray has publicly announced that was not participating in the Coral bidding process, so it is not clear who the alternative provider might be to whom Argonne can turn. However, Paul Messina, director of science for the Argonne Leadership Computing Facility, said in an interview: ‘There were more than enough proposals to choose from.’ The Argonne machine will use a different architecture from the combined CPU–GPU approach and will almost certainly be like Argonne’s current IBM machine, which uses many small but identical processors networked together — an approach that has proved popular for biological simulations.

While the Coral systems would perform at about 100 to 200 petaflops, Messina thought that their successors would be unlikely to be limited to 500 petaflops but that a true exascale machine would be delivered by 2022, although full production level computing might start later than that.

Gupta’s view that opening up the Power architecture was the new roadmap was echoed by IBM’s David Turek. He said: ‘We could not have bid for Coral without OpenPower. It would have cost hundreds of millions of dollars and taken us years. Why waste time and money if we could leverage OpenPower to within 5 per cent of its performance peak? We have lopped years off our plan.’ And in that accelerated development pathway, OpenPower ‘is critical to us’.

As an example, he cited the tie-up with Mellanox: although IBM has smart people in networking, he said, by itself it did not command enough expertise. Mellanox had unveiled its EDR 100Gb/s InfiniBand interconnect in June this year, at the ISC’14 supercomputer conference in Leipzig, and this will have a central role in the new Coral systems. However, Brian Sparks from Mellanox pointed out that the company intends to have a stronger interconnect available for Coral than EDR: ‘200G by 2017 is on our roadmap,’ he said.

IBM announced the ‘OpenPower Consortium’ in August 2013 and said that it would: open up the technology surrounding its Power Architecture offerings, such as processor specifications, firmware, and software; offer these on a liberal licence; and use a collaborative development model with its partners. However, Turek continued, IBM had not outsourced innovation to OpenPower: ‘The bulk of innovation is organic to IBM.’

This is the first in a series of articles by Tom Wilkie, prompted by last week’s SC14 supercomputing conference and exhibition in New Orleans. The second article in the series can be found here.

This story appears here as part of a cross-publishing agreement with Scientific Computing World.