Kyoto University Thinks Widening SIMD Will be Key to Performance Gains in New Intel Xeon Phi processor-based Cray System

Kyoto’s new system will feature a Cray XC40 with Intel Xeon Phi Processors (“Knights Landing”) and a high-performance DDN SFA14K large-scale storage system.

In the high performance computing (HPC) world, there’s rarely such thing as a “free lunch” when it comes to coaxing top-end performance out of HPC technology. But that doesn’t make upgrade cycles any less exciting for labs. With an imminent switchover to a new Cray system with next-generation Intel® Xeon Phi™ Processors (codenamed Knights Landing) planned for October, the Academic Center for Computing and Media Studies (ACCMS) at Kyoto University is a case in point; the ACCMS team is eagerly looking forward to a potential two-fold application performance improvements from its new system. But the lab is also well aware that there is significant recoding work ahead before the promise of the new manycore technology can be realized.

A cross-disciplinary mandate

Kyoto University, which includes three campuses, is one of the oldest and most prestigious universities in Japan. Throughout the university’s storied history, it has produced ten Nobel Price laureates, two Fields medalists and a Gauss Prize. The ACCMS at Kyoto University is an educational and research institute that was formed in April 2002. It includes several departments, including the Department of Computing Research and its Supercomputing Research Laboratory, which provide services to faculty and staff at Kyoto University, as well as the wider Japanese research community.

As part of a leading Japanese research institution, the Supercomputing Research Laboratory must support an array of major research projects and wide spectrum of applications from across multiple disciplines, including fluid and structural analysis, molecular dynamics and bio-chemistry, quantum and plasma physics, economics, and applied mathematics. That means workloads range from applications analyzing soil-tool interaction using large scale discrete element method (DEM) simulation, to studies on highly-parallel simulations of space-plasma particles, to finding fractal structures in the integration of industries and human population and in the scale and spatial distribution of cities.

Keeping up with researchers’ demands

The ACCMS supercomputing lab has been supporting users with a system of four supercomputers, ranging from the 200 – 600 teraflops class with an aggregate peak performance of more than 1.5 petaflops. The systems included the following:

Cray XE6 system with 940 dual AMD Abu Dhabi nodes, serving as a massively parallel processor (MPP) system, delivering a peak performance of 300.8 teraflops and a 59TB memory capacity
Appro GreenBlade 8000 with 601 dual Intel Sandy Bridge nodes, serving as a support cluster and delivering a peak performance of 242.5 teraflops and a 38TB memory capacity
Cray XC30 system with 416 dual Intel Haswell nodes, serving as an MPP system and delivering a peak performance of 428.6 teraflops and a 26 TB memory capacity
Cray XC30 system with 482 Intel Ivy Bridge and Intel Xeon Phi nodes (connected through a high-speed network), serving as an MPP system and delivering a peak performance of 583.6 teraflops (96.4 teraflops for the processor and 487.2 teraflops for the coprocessor) and a 15TB memory capacity for the processor and 3.8TB for the coprocessor

Professor Nakashima (photo courtesy
Kyoto University

Professor Hiroshi Nakashima, Chair of ACCMS’s Supercomputing Service Committee, explained that the systems have served ACCMS well, but research demands were naturally beginning to surpass what the aging systems could efficiently support. “Obviously, the XE6 Abu Dhabi and GreenBlade Sandy Bridge systems are now somewhat out-of-date with their floating point performance of about 160 gigaflops lagging behind the ever-increasing compute demands of our users. And although the Intel Xeon Phi coprocessor’s 1TFLOPS peak performance could be a solution for the demand, its limited [single instruction, multiple data] SIMD-vectorized computation capability and the ‘hosted’ configuration make it extremely tough to achieve good sustained performance for our wide spectrum of applications,” said Nakashima.

As the ACCMS team began considering new systems, it was particularly interested in taking advantage of advanced SIMD capabilities. Nakashima explained: “For more than ten years, we have chosen scalar-type supercomputers and in our two most-recent procurements (2008 and 2012) they were x86-based without graphical processing units (GPUs). This means we have let our users improve their applications evolutionally at first with Message Passing Interface (MPI) or OpenMP and then with the combination of these two very standard parallelization frameworks. Although such improvements have worked well in allowing our users to enjoy large-scale parallel computing with many thousands of cores, we are also well aware of high-performance computing trends, including 256 or 512-bit SIMD-vectorized computations. Our belief is that the improvement of an application to exploit the wide SIMD mechanism will be consistent with the past improvements with MPI and OpenMP because the SIMD mechanism allows us to make gradual improvements without discontinuous programming changes, such as with CUDA or OpenACC.”

Getting ready for an Intel Xeon Phi processor-based system

The ACCMS is planning for its new supercomputing system, which will feature a Cray XC40 with Intel Xeon Phi Processors (“Knights Landing”) and a high-performance DDN SFA14K large-scale storage system, to go live in early October. There will be a total of three systems in the new configuration, including a new Cray CS400 2820XT and one of the existing Cray XC30 systems (see bullet four in the above section for the details).

The XC40 will include 1,800 nodes and use the Cray Dragonfly direct network topology to obtain a projected peak performance of 5.48 petaflops and a memory capacity of 196.9 TB. The CS400 will include 850 Intel E5-2695 v4 processors connected with Intel Omni-Path Architecture—which along with the Intel Xeon Phi processor is a component of the Intel Scalable System framework—with a peak performance of 1.03 petaflops and a 106.3 TB memory capacity.

Nakashima explained that the ACCMS has had a good working experience with Cray over the last four years, so it was happy to continue the relationship with the purchase of the new systems. He said that while the lab is particularly excited to get started using the next-generation Intel Xeon Phi processors, he also recognizes it will take time to start seeing the full benefits of the new system.

The upgrade from the Intel Xeon Phi coprocessor to the manycore Intel Xeon Phi processor should give us a ‘free’ two-fold or so performance improvement by increasing the number of floating point units [not of the SIMD width]. From this baseline improvement, we then will need to exploit the enhancement of the SIMD mechanism by widening its width. In other words, we need to transition from the SSE4 [Streaming SIMD Extensions 4] instruction set of AMD Opteron to AVX-2 [Intel® Advanced Vector Extensions 2] of Broadwell and from AVX-2 of Haswell to AVX-512 of the Intel Xeon Phi processor, in a gradual manner to achieve the sustained performance improvement proportional to that of peak performance,” said Nakashima.

Nakashima also expects that the increase of core counts and memory bandwidth in the dual Broadwell cluster should nearly double the performance of applications that formally ran on the XE6 and GreenBlade 8000 systems. The ACCMS will use the Intel Omni-Path Architecture (Intel OPA) in its dual Broadwell cluster to optimize bandwidth and latency. ‘We know that OPA is a good solution with respect to price and performance, and we were happy with the performance numbers of our benchmarks for the new system” said Nakashima .

Ultimately, Nakashima is simply looking forward to seeing what ACCMS can do with its new system. “Based on our past experience with Cray systems employing Intel Xeon processors, we have a lot of confidence in the potential of the new Intel Xeon Phi processor system,” said Nakashima. And although we are well aware that exploitation of its high peak performance is not going to be a kind of free-lunch, we believe we can leverage previous work and experience to be a winner in the game of wide SIMD,” he concluded.

Sign up for our insideHPC Newsletter

Sponsored Guest Articles

Accelerated HPC for Energy Efficiency with AWS and NVIDIA

White Papers

Energy efficiency drives HPC to the cloud

Featured RSS Feed

More News from insideBIGDATA