How the QPACE 2 Supercomputer is Solving Quantum Physics with Intel Xeon Phi

In this special guest feature from Scientific Computing World, Tilo Wettig from the University of Regensburg in Germany describes the unusual design of a supercomputer dedicated to solving some of the most arcane issues in quantum physics.

QPACE 2 prototype at the University of Regensburg (Image courtesy of Tilo Wettig)

QPACE 2 prototype at the University of Regensburg (Image courtesy of Tilo Wettig)

A supercomputer where all the calculations are done on Intel Xeon Phi co-processors has just gone into operation at the University of Regensburg in Germany. The machine was specially designed to solve problems in Quantum Chromodynamics – part of the standard model of particle physics.

Quantum Chromodynamics (QCD) is one of the fundamental theories of nature. It explains how particles such as the proton that were previously thought of as being elementary are in fact made up of smaller constituents, such as quarks and gluons. High-precision calculations of QCD are needed to analyse the data collected at particle accelerator experiments, such as the LHC at CERN, and to distinguish between ‘old physics’ and new discoveries. Such calculations have to be done on supercomputers, running a discretised version of the theory known as Lattice QCD.

The Lattice QCD community has had a long history of designing special supercomputers to carry out these calculations, starting as long ago as the 1980s. Some of these machines influenced the design of large-scale commercial systems: for example, the QCDOC machine developed at Columbia University can be viewed as a prototype for IBM’s BlueGene/L.  The Regensburg system is the latest QCD machine. Called QPACE 2, where QPACE stands for QCD Parallel Computing Engine, it was designed and manufactured in Europe by a collaboration of scientists at the University of Regensburg in Germany and Eurotech HPC in Italy.

From the hardware point of view, QPACE 2 is a collection of identical nodes connected via InfiniBand. Each node consists of 4 Intel Xeon Phi co-processors, also known as Knight’s Corner (version 7120X); a dual-port FDR InfiniBand card (Mellanox Connect-IB); and a low-power Intel Xeon CPU (E3-1230L v3), on a PCIe card. These components are connected via a PCIe switch. One of the distinguishing features of QPACE 2 is that all calculations are done on the Phi co-processors. The CPU is only used to perform system management services. This approach significantly reduces cost and energy consumption of the machine and simplifies programming.

Within a node, communication between the co-processors proceeds via PCIe and thus does not need expensive networking components. Communication between nodes is facilitated by connecting the two ports of each node to FDR Infiniband switches that are arranged in a two-dimensional hyper-crossbar topology. The advantage of a hyper-crossbar over a (fat) tree is lower cost. The advantage over a torus is full connectivity in every single dimension with one switch hop, and all-to-all connectivity by going through at most one intermediate node.

QPACE 2 employs a novel liquid-cooling concept developed by Eurotech. Water flows through very thin plates attached to the compute cards. The cooling-water temperature can be up to 45C. This approach allows for free cooling all year round and thus cuts the cooling costs significantly. It also allows for very dense system integration: eight nodes fit in three height units of a 19-inch rack. Since each node has a double-precision peak performance of 4.8 TFlop/s this translates to 12.8 TFlop/s per height unit.

Tilo Wettig, University of Regensburg

Tilo Wettig, University of Regensburg

Developing high-performance code for a new architecture, in this case the wide SIMD units of the Knight’s Corner co-processors, poses some challenges. In general, one should first select the algorithm that is most appropriate for the given hardware architecture. Then one should find the optimal data layout for the given algorithm and architecture. Finally one should go through the usual implementation and optimization steps such as vectorization, threading, parallelization, communication latency hiding, etc. A detailed description of this approach is given here. Our code achieves strong scaling to at least 1024 Knight’s Corner co-processors for problems of a size typical of Lattice QCD.

The prototype machine in Regensburg consists of 64 nodes, i.e., 256 Knight’s Corner co-processors with a total peak performance of 310 TFlop/s in double precision. It is currently being used for Lattice QCD calculations, but the architecture is also suitable for a variety of other applications that can be mapped to the SIMD architecture of the Xeon Phi.

Eurotech HPC is selling this type of architecture under its ‘Aurora Hi√e’ product line. Eurotech Hi√e can be shipped not only with Knight’s Corner co-processors but also with Nvidia K40 or K80 GPGPU accelerators, with a density of up to 128 nodes per rack. The Hi√e extends the versatile architecture used in QPACE2 for usage in computational biology and chemistry, seismic, rendering, deep learning, data analytics, and CAE.

This story appears here as part of a cross-publishing agreement with Scientific Computing World.

Sign up for our insideHPC Newsletter.