Optimizing the HPCG Benchmark on GPUs

Print Friendly, PDF & Email
Everett Phillips

Everett Phillips

Over at the Parallel for All Blog, Everett Phillips and Massimiliano Fatica write that GPUs offer good acceleration on the new HPCG benchmark that has been designed to augment Linpack as a measure of performance for the TOP500. Their GPU porting strategy focused on parallelizing the Symmetric Gauss-Seidel smoother (SYMGS), which accounts for approximately two thirds of the benchmark flops.

GPU-accelerated supercomputers have proven to be very effective for accelerating compute-intensive applications like HPL, especially in terms of power efficiency. Obtaining good acceleration on the GPU for the HPCG benchmark is more challenging due to the limited parallelism and memory access patterns of the computational kernels involved. In this post we present the steps taken to obtain high performance of the HPCG benchmark on GPU-accelerated clusters, and demonstrate that our GPU-accelerated HPCG results are the fastest per-processor results reported to date.

The first HPCG list was published at ISC14 and included 15 supercomputers. Instead of looking at the peak flops of these machines, we evaluate the efficiency based on the ratio of the HPCG result to the memory bandwidth of the processors. The following table shows the results of the top 4 systems that submitted optimized results.

HPCG RANK MACHINE NAME HPCG GFLOP/S #PROCS PROCESSOR TYPE HPCG PER PROC BANDWIDTH PER PROC EFFICIENCY (FLOPS/BYTE)
1 Tianhe-2 580,109 46,080 Xeon Phi-31S1P 12.59 GF 320 GB/s 0.039
2 K 426,972 82,944 Sparc64-viiifx 5.15 GF 64 GB/s 0.080
3 Titan 322,321 18,648 Tesla-K20X+ECC 17.28 GF 250 GB/s 0.069
5 Piz Daint 98,979 5,208 Tesla-K20X+ECC 19.01 GF 250 GB/s 0.076

 

If you’d like to learn more this work on HPCG, be sure to attend Everett Phillips’ talk in the NVIDIA Booth #1727 at Supercomputing 2014 on Tuesday, November 18 at 10:30am.

Read the Full Story.

Sign up for our insideHPC Newsletter.