This is the second article in a two-part series about the challenges facing the HPC community in training people to write code and develop algorithms for current and future, massively-parallel, massive-scale HPC systems.
One of the most difficult design decisions facing HPC algorithm and software developers is how to partition data across numerous distributed computational devices. For example, a modern leadership class supercomputer can contain tens of thousands of Intel Xeon Phi coprocessors. Unlike the SIMD programming model that can efficiently map arithmetic operations to both vector and streaming processor architectures, no common framework exists to efficiently map algorithms to distributed hardware. Succinctly, communications latency caused by speed of light delays and communications software stack overhead represent an unavoidable performance limiting obstacle for MPI-based applications. Similarly, data movement between devices imposes PCIe latency and bandwidth limitations for offload mode applications running on devices like Intel Xeon Phi and GPUs. The best current approach to scalable data partitioning is to gain a deep understanding of how data is used by the application, strive to exploit data locality as much as possible, and to peruse technical papers and books such as “High Performance Parallel Pearls” volumes one and two, or “GPU Computing Gems” to find similar use cases where scalable solutions have been found.
The importance of data partitioning
The importance of data partitioning cannot be overestimated as this single design decision determines the scalability of the application algorithms and ultimately the usefulness of the application on current and future machines. Due to the failure of Dennard’s scaling laws, hardware manufacturers can no longer increase hardware clock rates leaving parallelism as the only route to significantly increased performance. Applications that cannot scale are doomed to run at roughly their current performance levels on modern and future hardware architectures. The adage that, “a supercomputer is an expensive device that turns a compute bound problem into an IO bound problem” reflects both the truth and pain of scaling limitations.
Good hardware design can greatly help both software engineers and software development efforts. For example, the new Intel Knights Landing (KNL) based Intel Xeon Phi systems contain three key features that can enable new algorithm development and breathe new life, performance, and scalability into legacy applications through: (1) the inclusion of high-bandwidth 3D stacked memory, (2) the ability to run self-hosted – meaning that the PCIe bus no longer acts as a bottleneck to the parallel device, and (3) the inclusion of an on-chip Intel Omni-Path network controller. GPU vendors such as AMD and NVIDIA also recognize the necessity of these features as 3D stacked memory and the ability to minimize data transfer overhead are also on their product roadmaps.
Most HPC applications are memory-bandwidth limited, which is why 3D stacked memory (referred to as MCDRAM on the new Knights Landing family of devices) is so important due to its ability to deliver 5x the bandwidth of current GDDR5 DIMM solutions. In the first place, this faster memory can accelerate current memory bandwidth limited applications – conceivably by as much as 5x. In the HPC world, a 5-times performance increase is a stellar improvement. Secondly, and not so obvious, is that prefetching and other latency hiding techniques can take advantage of the greater memory bandwidth to asynchronously move data without affecting processor performance. In fact, the KNL family of Intel Xeon Phi devices actually contains a unified yet separate, abundant yet slower DDR3 memory subsystem that can be further leveraged for latency hiding without impacting the high-performance MCDRAM memory subsystem.
All data movement is expensive
The advantage of running self-hosted should be self-apparent as all data movement is expensive in terms of energy consumption and the negative impact it has on application performance. For these reasons, all good HPC applications – and HPC educational programs – stress the importance of data locality because it enables high cache and on-chip resource utilization. Data locality translates directly into increased application performance. In addition, applications that have high data reuse tend to exhibit good scaling in distributed computing environments.
Even with the advantages of programming according to a self-hosted model where the target supercomputer utilizes a self-hosted runtime environment (like the Trinity supercomputer), it is best to teach and write HPC codes using the offload computational model. Experience has shown that successful HPC applications have very long lifespans measured in decades. Yes, writing software using the offload mode adds an extra burden (and challenge) for the programmer, but the use of offloading can be disabled via a compiler switch and pointer swapping can eliminate data movement.
If it is believed that the application will be of interest to many users, then it is best to build in the ability to utilize the memory of a remote device during the design and implementation phase because: (a) offload capable applications are very portable which greatly increases the potential ROI (Return On Investment) of the software development effort – especially when open-standard offload models such as OpenMP 4.0 and OpenACC are used – as the application can simply be recompiled to run on new, discrete devices as they become available, (b) many supercomputer procurements include the option to add additional discrete devices at a later time, which means offload-based applications can quickly capitalize on these relatively low-cost yet high-performance upgrades, (c) offload-mode applications have the ability to avoid serial processing bottlenecks as on-chip heat and space limitations mean that the sequential processing performance of Intel Xeon Phi Coprocessors and other massively-parallel devices have to be much lower than that of a high-end Xeon or other processor, and (d) code that is designed to use the memory space of external devices can potentially be adapted to exploit future network and cloud-based environments – although this latter characteristic is more future-proofing argument than tangible current benefit.
The value of an on-chip network controller
Finally, the inclusion of an on-chip network controller such as Intel’s Omni-Path can reduce both latency and redundant data movement. Given that many HPC applications are network latency bound, even the smallest reduction in network processing or data transfer overhead can result in significant application performance improvements. In addition, on-chip controllers save space and give systems designers the ability to design smaller, more closely integrated supercomputers (a benefit in terms of cabling and the latency incurred by routers and other electronics required to move the network data), or to add an extra network card (when space permits) which can be used to double network bandwidth via bonded interfaces or potentially provide multiple, application-specific optimized network topologies within the same supercomputer.
While good hardware design is important, code modernization is easily the most beneficial, significant, and long-lasting investment the HPC community can make to capitalize on current and future hardware investments. A huge number of legacy HPC applications were designed when running with two or four threads represented advanced software design, when the then current leadership class supercomputers contained a few tens or possibly hundreds of MPI clients, and a terabyte of storage was a huge amount of data. Modern technology far surpasses those long-gone, pioneering efforts at high performance computing. A pointed joke in the HPC community that highlights the need for software modernization observes that we spend hundreds of millions of dollars building machines to run legacy applications written by students and professors decades ago. The recent $500M combined DOE Coral and Trinity procurements emphasize the amount of real money being invested today in massively-parallel leadership class supercomputer hardware. The extent of this investment is underscored by a pointed quote by HPC luminary Gary Grider in his Supercomputing 2014 talk, Preparing Applications for Next Generation IO/Storage, when, as part of his economic analysis he observes that plotting, “$M in Log Scale means you have hit the big time!”
Time for the software community to step up
In short, the advances in computational hardware and computing models over the past seven years – spurred by the move to massive parallelism forced by the failure of Dennard’s Scaling Laws represents a disruptive change in the HPC community. The significant investment, rapid evolution, and widespread adoption of massively parallel computing devices like Intel Xeon Phi and GPUs clearly demonstrate that these devices are the logical next step in HPC hardware. The onus is now on the HPC software community as many HPC applications are not currently using, or are simply unable to use these devices.
The choice is clear, the HPC community needs to invest in code modernization to capitalize on the latest procurements that will certainly act as a bellwether for future HPC systems and architectures. The long-term implications are very real as stagnation in the HPC community translates to a loss of competitiveness in a variety of military, industrial, and economic arenas, not to mention the future impact on enterprise computing as HPC tends to be the proving ground for the next generation commercial data centers. Let us hope that both software modernization and HPC education will give scientists and engineers the ability to capitalize on the latest generation of multi-teraflop per second hardware. The performance is there, we just need modern software to capitalize on it.
One effort to address this is the Code Modernization program introduced by Intel’s software group that includes a code modernization library of technical solutions.
In case you missed it, you can read Part 1 of this feature story.
Rob Farber is recognized author and consultant with an extensive background in HPC and a long history of teaching and working with national labs and numerous corporations engaged in HPC worldwide. Rob has authored/edited several books on Intel Xeon Phi and GPU computing. He is the CEO/Publisher of TechEnablement.com.