Dr. David Rohr is a postdoctoral scholar at the Frankfurt Institute for Advanced Studies (FIAS). He’s currently working on several projects related to high energy physics and high performance computing – mostly in cooperation with CERN (the European Organization for Nuclear Research) and GSI (GSI Helmholtz Center for Heavy Ion Research). Dr. Rohr is in charge of the GPU-based real-time track reconstruction for the ALICE experiment at CERN.
insideHPC: What is your current project?
David Rohr: I am currently working on our implementation of the Linpack benchmark, which we call HPL-GPU. The Linpack benchmark consists of a process to solve a dense system of linear equations and is typically used to determine the ranking of the fastest supercomputers in the world. In particular, I optimized the Linpack implementation for graphics processing units (GPUs), mostly AMD graphics solutions, and heterogeneous systems.
As part of the work at FIAS, I helped to design and supervised the installation and commissioning of several HPC clusters, among them the SANAM cluster (placed number two on the Green500 List in Nov. 2012) and the Lattice-CSC (achieved the number one position in Green500 in Nov. 2014). SANAM and Lattice-CSC were installed at GSI in Darmstadt, Germany.
insideHPC: What was the Lattice-CSC cluster designed to do?
David Rohr: The Lattice-CSC is a general-purpose multi-GPU cluster built from off-the-shelf components, which can execute virtually any scientific application that can make use of GPUs. The main purpose of Lattice-CSC, however, is to run simulations in the field of Quantum Chromo Dynamics or QCD. QCD is the physical theory describing the strong force, one of the four fundamental forces in the universe. GSI is currently building a new particle accelerator with experiments for heavy ion studies –called FAIR – Facility for Anti-Proton and Ion Research. QCD calculations are particularly important for heavy ion physics simulations. Therefore, GSI and FAIR need significant resources for QCD computations.
The only general approach that can compute QCD properties from first principle is Lattice-QCD (LQCD), which discretizes the problem onto a four-dimensional space-time-grid. Lattice-CSC uses the Lattice-QCD approach for its QCD simulations, and it employs a GPU-accelerated OpenCL™ based LQCD application, which was developed by a cooperation involving FIAS. Lattice-QCD is the most important application on Lattice-CSC – hence the name. OpenCL™ is an open, royalty-free industry standard that makes much faster computations possible.
The most compute intensive part of LQCD simulations is the inversion of the so-called Dirac-Operator, which requires a large sparse matrix-vector multiplication. This matrix-vector multiplication is usually called D-Slash, it is the computational hotspot of LQCD simulations, and it requires extreme memory bandwidth.
insideHPC: What hardware requirements were you looking for, and what made you choose AMD FirePro™ GPUs over consumer GPUs or a competitive offering?
David Rohr: With LQCD being the main focus of Lattice-CSC, and the D-Slash kernel being the computational hotspot, we wanted to employ hardware which provides the best D-Slash performance for the fixed budget we had. This means we need hardware with great memory bandwidth. Important for us is not the acquisition cost, but the total cost of ownership. And with the rising energy prices, power efficiency has become a very important aspect. Fortunately for us, GPUs in general provide both, great power efficiency and great memory bandwidth. Since our application is based on OpenCL™, we wanted a GPU with good OpenCL™ support of course. In addition, maintainability and stability play an important role for a cluster of the size of Lattice-CSC. We decided that the AMD FirePro™ S9150 GPU was the best choice, being a professional card that offers very high memory bandwidth and good energy-efficiency.
Besides good memory bandwidth and power efficiency, the AMD FirePro S9150 also offers the highest available double-precision compute performance of a single-GPU card. Even though this aspect is not necessary for our memory-bandwidth-limited LQCD application, it allows us to run other applications with great performance on Lattice-CSC as well, and it is the basis for the great power efficiency we demonstrated by achieving first place in the Green500 list.
insideHPC: Can you tell us more about the cluster specifications?
David Rohr: The cluster has 160 compute nodes, each node equipped with 4 GPUs, 2 CPUs, and 256 GB of memory. A single compute-node achieves more than 10 TFLOPS double precision performance and the total system has a theoretical peak performance of 1.7 PFLOPS. The aggregate memory-bandwidth of the four S9150 GPUs of one node is 1280 GB/s, and the aggregate GPU memory per node is 64 GB.
It required a lot of work and involved optimizations on software and hardware to achieve this benchmark of 5.27 GFLOPS/W. We used our own DGEMM (matrix-multiplication) and HPL-GPU (Linpack benchmark) software, which I developed at FIAS.
The key aspects to achieve this energy-efficiency are:
- Support from AMD, who provided an OpenCL™ DGEMM kernel with excellent efficiency.
- Software we designed that can dynamically distribute all the workload among all available GPUs and CPUs. Here our software offers two schemes. In order to achieve the best performance, the workload should be distributed such that all available compute devices – both GPUs and CPUs — are used to their full extent. In contrast, in order to achieve the best efficiency, one has to execute the particular workloads on the compute device where they run most efficiently. For instance, this means that during some time we leave the CPU idle by intent, and execute more workload on the GPU. This reduces the performance slightly, but results in better net efficiency.
- We use dynamic frequency and voltage scaling to run at optimal CPU and GPU voltage and frequency at every point in time.
- Our software dynamically adapts certain internal parameters during the Linpack benchmark runs. This ensures that at every point in time, we run with the optimal set of parameters.
- By optimizing both hardware and software, we were able to achieve 5.27 GFLOPS/W using AMD FirePro S9150 GPUs, an 18% performance/watt advantage over the June 2014 Green500 List’s winner..
- We were able to use custom open-source DGEMM/HPL software based on OpenCL™, and we dynamically distributed workload between the CPUs and GPUs.
- With dynamic parameter adaption, we could get the optimal performance or optimal efficiency at every point in time
- AMD was kind enough to provide assistance in helping us (optimize/troubleshoot) our OpenCL™ code
insideHPC: What is your experience in working with OpenCL? How does it compare to other solutions?
David Rohr: As a research institute, we have to try to avoid any vendor-locks where ever possible. The great thing about OpenCL™ is that it is vendor and platform independent and it is an open standard. At the beginning, it was lacking a bit behind competitive APIs. But with the new OpenCL™ 2.0 specification, OpenCL™ finally provides most features we would like to have. One important aspect that is still missing in my opinion is support for C++ in GPU kernels – at least in a limited form. A competitive solution has this feature and OpenCL™ should have it too. AMD is providing an extension that allows limited C++ in GPU kernels, which is in my opinion the right way to go. However, in order for this to become platform independent, it belongs to the standard. We are currently using these OpenCL™ C++ extensions from AMD in our online compute farm at CERN in Geneva with great success.
insideHPC: Where do you think these architectures are headed?
David Rohr: I’m seeing an evolution in the convergence of CPU and GPU architectures. Some years ago, CPUs were single-core serial complex processors, and GPUs were parallel many-core chips with very simple cores. At that time, this prohibited execution of general purpose applications on GPUs. Today, we see an influence from the GPU world in the CPUs: we have multi-core CPUs with more and more elaborate vector instructions. But also the GPUs have taken over some aspects of CPUs – such as the cache-hierarchy — allowing execution of general-purpose applications.
A large disadvantage of GPUs is the limited GPU memory, and the fact that the GPU has to go through the PCI Express® interface to access the system memory. I think in the future, we will see heterogeneous solutions, which combine both fast serial CPU cores and many parallel GPU cores in one chip with unified memory.
I see one problem with the current approaches. One large advantage of GPUs is their great memory bandwidth, which comes from the fast GDDR5 memory. The current APUs use the system memory, which is DDR3, and has significantly less bandwidth. We need an approach that combines the good aspects of both, the heterogeneous nature of the APU, and the high memory bandwidth of current discrete GPUs.
In addition, we need the right programming models to program these new chips and these models should be open standards. Both, recent developments in competitive solutions and OpenCL™ 2.0 have laid a good foundation here, but only OpenCL™ is free and vendor-independent.
insideHPC: What’s the next step for you?
David Rohr: To optimize Lattice-QCD performance and power efficiency further, a logical step is to plug even more GPUs in a server. We are currently experimenting with eight-GPU servers. We have to find the best trade-off, because at some point the limited PCI Express bandwidth will deteriorate the performance.
GSI is currently building a new particle accelerator in the scope of the FAIR project. Lattice-CSC has laid a good foundation for QCD simulations, but FAIR will require a multiple of the available compute resources for the event reconstruction for the FAIR-experiments. We are currently building a new data center at GSI, which will be extendable to up to 15 MW of electricity. This datacenter is also very power efficient. We expect a cooling overhead in the range of 5%, i.e. a PUE of 1.05. In the next years we will install a large cluster for data processing for the FAIR experiments there. As a matter of fact, traditional processors cannot deliver sufficient performance for this system. Therefore, GPUs, or other accelerators, will play an important role.
For more information on AMD FirePro™ Server GPUs, the L-CSC cluster and the Green500 List win, please visit http://www.fireprographics.com/hpc.