New Approaches to Energy Efficient Exascale

In this feature, Tom Wilkie from Scientific Computing World reports on approaches to energy efficiency on display at ISC’14.

imgresARM-based processors or Intel Phi co-processors could form the heart of energy-efficient architectures paving the way to exascale machines and both types of technology were on display at the International Supercomputing Conference (ISC’14) held in Leipzig at the end of June.

AppliedMicro Circuits Corporation, a California-based data centre semiconductor company, caught the attention of many delegates with its X-Gene, the first ARMv8 64-bit-based Server on a Chip solution. Sharing the same stand, SoftIron, a company based in Southampton, UK, was demonstrating how it has used the AppliedMicro X-Gene to create its enterprise-grade 64-0800 server motherboard.

A different approach, the European Dynamical Exascale Entry Platform (DEEP) project, will go live later this year at the Juelich Supercomputer Centre in Germany. It employs more ‘conventional’ hardware from Intel – both Xeon CPUs and Phi co-processors, but does so in an innovative architecture. As displayed at ISC’14, DEEP combines a standard InfiniBand cluster of Intel Xeon nodes, with a new, highly scalable ‘booster’ consisting of Phi co-processors and a high-performance 3D torus network from Extoll, the German interconnect company spun out of the University of Heidelberg. The creators of DEEP believe that the use of the Phi co-processor in this way will deliver outstanding energy efficiency. But they have not spurned the plumbing approach either and have partnered with the Italian company Eurotech to design high-precision cold plates for the booster nodes with the liquid coolant connections passing through the backplane.

In this video from ISC’14, the DEEP and DEEP-ER Project teams describe their prototype hardware and software.

Interestingly, AppliedMicro’s ARM-based technology has been taken up by two Italian computer companies. Eurotech, which provides both embedded and supercomputing technologies (and is involved in the DEEP project as just noted), has paired X-Gene with NVIDIA’s Tesla GPU to create a novel HPC system architecture that combines extreme density and energy efficiency. AppliedMicro has also joined forces with E4 Computer Engineering, to design the EK003 a low-power solution part of E4’s ARKA series, targeting HPC and big data workloads. This also combines the X-Gene CPU with NVIDIA’s Tesla K20 GPU.

Meanwhile, Norman Fraser, CEO of SoftIron said: “Our server motherboard delivers high performance 64-bit ARM enabled computing with full virtualization to the enterprise space for the first time. Until now, all ARM based servers have been microservers; our server is the first ARM “macroserver” which is a genuine alternative to mainstream x86 servers for a wide range of scale deployment scenarios.” He believes that the new motherboard will offer up to twice the performance-per-watt of more traditional x86 designs and has the potential to be deployed at twice the rack density.

According to Dr Paramesh Gopi, president and CEO of AppliedMicro, the company first licensed the 64-bit architecture in 2009 and high-performance computing was, he felt, the most exciting development of the technology. He noted that AppliedMicro, by partnering with Nvidia for the GPU accelerator aspect and with Mellanox for the interconnects, had ‘got all the pieces together for a production-ready solution’: high-performance processors; memory; and interconnects. Gene-X offers integrated support for remote direct memory access (RDMA), he said, and thus conforms to Mellanox’s standards.

Eurotech has already adapted its ‘Aurora Brick’ concept to accommodate the ARM architecture, according to the company’s Giovanbattista Mattiussi. The Aurora Brick is a new, highly modular system, based on technology developed in the QPACE2 project with the University of Regensburg in Germany, that puts a premium on the efficient use of all three dimensions to pack in the electronics but also to accommodate the cooling of the electronics, leaving no space unused, he said. QPACE is an acronym for QCD Parallel Computing on the Cell Broadband Engine and the project is developing massively parallel and scalable supercomputers for applications in lattice quantum chromodynamics. According to the project leader, Tilo Wettig, who is professor of theoretical physics in the particle theory group at Regensburg University in Germany, QPACE 2 is currently in the final debugging phase and is expected to go live later in summer or in early autumn. In the QPACE project, the underlying processor architecture is Intel’s but, Mattiussi said: ‘then we can take it to the next level, and so switch to Nvidia/ARM infrastructure, using standard components.’

The DEEP project also focuses on hardware, albeit more conventional components assembled in a novel architecture, according to Dr Estela Suarez, from the Institute for Advanced Simulation at Juelich in Germany. “We are using accelerators autonomously,” she said. By creating a cluster of Phi co-processors in the ‘booster’ the advantage is that the booster can communicate directly through the network and thus “we can get offload much larger parts of the code into the booster and get around the bottleneck that currently exists with standard accelerators.”

Even before the prototype goes fully live at the end of the year in Juelich, the project has been given ‘Extended Reach’ – thus becoming DEEP-ER – to address resiliency and also efficient parallel I/O, both of which represent major challenges to very large-scale systems that will be deployed in the future. The simple fact of the matter is that any Exascale system will have to be so highly parallel that it will have so many components that some of them will fail and this has to be accepted as part of routine operation. The question for Exascale machines is not whether component failure can be avoided, but rather, how can the system taken as a whole cope with the failure of individual components without significant degradation of performance?

According to Dr Suarez, “We are combining two approaches – the more traditional application-based check-pointing but also task-based resiliency using OmpSs.” The OmpSs programming model extends OpenMP with new directives to support asynchronous parallelism and heterogeneity. This task-based checkpoint/recovery approach will yield a more fine-grained resilient architecture. The system will, moreover, be designed to detect, isolate, clean up and restart tasks that have been offloaded to the booster and that have failed, but without the need for a full application recovery.

The I/O features of the deeper project will have the added benefit of helping the resiliency. DEEP-ER uses the Fraunhofer parallel file system BeeGFS – formerly known as plain FhGFS. The design of the file system is intended to avoid as much communication as possible with the centralised parallel storage servers and to integrate fast new storage techniques such as non-volatile memory.

According to Dr Suarez, the advantages of DEEP and DEEP-ER are that because the systems are using traditional processors, albeit in a novel design, “coding does not have to be changed.” The difficulty is to identify which parts of a program would run best on the booster and so the team is using performance analysis tools to make the task easier. Overall, she believes that it will be less complicated to port code to DEEP than to rewrite into CUDA for GPU acceleration. “In the end, you have to concentrate on optimizing the code for the booster – the rest of the code you can keep as it is,” she said.

The diversity of approaches suggests that no one can yet discern the detail of how an energy efficient Exascale machine will look. There are unhappy precedents already, with the effective demise of Calxeda which was an early entrant into the race to provide computers based on the ARM architecture. Its system was based on 32-bit rather than 64-bit technology, which may well have contributed to the lack of take-up. But one thing is clear. The pace of change in hardware far outstrips software development. Many application codes used in science and technology have been developed not just over years but over decades, with some still having Fortran elements hidden away inside them. The argument for Exascale is that it will facilitate the wider diffusion of petascale computing to smaller engineering and science-based companies – that the development of Exascale technology will allow companies to compete because they can compute. This will not happen if it is difficult or costly to port existing legacy codes. Novel architecture or novel hardware will have to confront the issue of making it easy to run very traditional software.

This story appears here as part of a cross-publishing agreement with Scientific Computing World.