This is the first article in a two-part series by Rob Farber about the challenges facing the HPC community in training people to write code and develop algorithms for current and future, massively-parallel, massive-scale HPC systems.
Numerous Gordon Bell Prize winners and a variety of HPC success stories clearly demonstrate that some programmers are able to exploit the full capabilities of today’s petascale supercomputers. The challenge is that most HPC software efforts (and programmers) tend to have a checkered history as developers confront new machine architectures containing vector processing units, non-uniform memory architectures, distributed memory accelerators, and hardware execution models utilizing hundreds to thousands of threads, plus an explosion of distributed computational elements now counted in the tens to hundreds of thousands of devices in large clusters.
Conventional computer science education that focuses on Turing machines and Java programming fails when students attempt to write performant and scalable applications to run in modern HPC environments. Even experienced HPC programmers struggle because there are simply too many scattered and obtuse bits of information that must be discovered and twiddled to achieve high performance on individual HPC clusters. In short, modern software development environments simply contain too many performance degrading rough edges that are either not obvious or are confusing. Even so, success stories such as TBB, Cilk+, OpenCL, and OpenMP show that it is possible to provide most programmers with consistent software development environments that do provide efficient and scalable massively-parallel execution capabilities.
“Just enough and no more”
The key to high performance HPC programming for the masses is “just enough and no more” – meaning just enough structure is provided so that the programmer is minimally restricted yet their programs – written according to the programming model – have the ability to run efficiently on today’s systems and the potential to scale to the exascale and beyond on future systems. One example that has achieved long-term success is vector programming according to a SIMD (Single Instruction Multiple Data) model.
SIMD is subset of the general-purpose MIMD (Multiple Instruction Multiple Data) programming taxonomy proposed by Flynn. SIMT is modeled after SIMD model. Applications based on these technologies tend to run efficiently on both vector hardware and on accelerator devices. In other words, algorithms and applications that exploit SIMD and SIMT parallelism run extremely well on all current HPC hardware architectures and represent a very, very safe bet that they will run extremely efficiently on future hardware architectures as well. The success of SIMT based GPGPU programming Language in HPC and the inclusion of #pragma SIMD in OpenMP 4.0 indicates the general utility and widespread adoption of the vector programming model exploiting SIMD parallelism.
From a training perspective and in the spirit of “just enough and no more”, students and application programmers who focus first on programming according to the SIMD programming model will generate efficient codes for most HPC application areas. Only when absolutely required should programmers venture into MIMD (Multiple Instructions Multiple Data) programming because of the difficulties in mapping general MIMD programs to vector hardware and accelerators.
Thread parallelism is the counterpart of SIMD and vector programming as thread-level parallelism is as essential to performance as the efficient use of the hardware vector and arithmetic capabilities. James Reinders (Chief Evangelist, Intel Corp.) captured this relationship between arithmetic efficiency and parallelism relative to performance in the following graphic:
To utilize very large numbers of threads (counted in the thousands and even millions), most programmers immediately grasp the potential of the TLP (Thread Level Parallelism) programming model. Succinctly, TLP programs place an efficiency bet by attempting to utilize as many threads as possible. The programmer wins the efficiency bet when the hardware can manage large numbers of threads to keep itself fully occupied. The result is high utilization of all the cores in a CPU or massively parallel device.
The beauty and the drawback of the TLP model
The beauty in the TLP model is that programmers need to focus on only one thing: use as many threads as possible. The drawback to this approach is that performance critical on-chip resources such as registers can be oversubscribed, thus causing a performance degradation. To remedy this, programmers can write their code according to the ILP (Instruction Level Parallelism) programming model, which strives to choreograph the data movement and computations inside one thread so that a minimal number of threads can be utilized to keep the parallel hardware busy. The minimalist nature of ILP programming maximizes the per-thread amount of performance critical on-chip resources which maximizes performance and facilitates scaling to large numbers of on-device computational units. The challenge with ILP programming is that the programmer must have fairly detailed knowledge of how the underlying hardware works to correctly choreograph the computation without losing other advantages provided by TLP such as latency hiding. (Most modern HPC programs are memory bandwidth limited, so hiding memory access latency is critical to achieving high application performance.)
Higher-level programming abstractions such as OpenMP 4.0, OpenACC, TBB, Cilk+, and others give programmers the ability to focus on TLP programming yet potentially accrue the benefits of ILP resource utilization. How well this potential is fulfilled depends on the compiler, which can analyze a parallel loop and optimize the code generation for high parallelism, high arithmetic (e.g. SIMD) efficiency, and the best per-thread resource utilization possible via ILP.
A challenge with modern HPC hardware is that many massively-parallel devices (e.g. Intel Xeon Phi and GPUs) are becoming so powerful that a single computational kernel cannot fully utilize the device. A kernel is a parallel region of the code that can run locally or on an accelerator. Current devices can now deliver 3-7 TF/s (Trillion Floating-point operations per second) of single or double-precision performance, yet many computational kernels don’t require such large floating-point capability. The result is poor device utilization.
Task level parallelism
The solution is task level parallelism where multiple kernels run concurrently on the highly parallel hardware. In this way, the programmer can give even very powerful devices enough work to fully utilize all the processing capabilities. Happily, most programming models make the creation and execution of multiple concurrent tasks relatively straight-forward.
Currently programmers have to manually analyze the data dependencies amongst their computational kernels to determine which kernels can run concurrently. These dependencies can be expressed as a dependency graph. However, the automatic creation and analysis of dependency graphs is an active area of research that holds much promise for the automation of both massively-parallel and distributed computation.
Finally, minimizing data movement in a target environment using either an offload mode or via MPI (Message Passing Interface) is a key challenge in HPC that will be discussed in the next article. In short, a variety of approaches can be taken to distribute computation across large numbers of nodes, but most approaches are problem dependent. As a result, current tribal knowledge garnered from technology papers and books such as “High Performance Parallel Pearls” volumes one and two, or “GPU Computing Gems” are necessary resources that can provide valuable usage cases.
In combination, SIMD computation and TLP programming are high performance robust concepts that have been proven over many generations of supercomputer architectures. These approaches help programmers structure their codes to avoid many performance traps so their applications can run efficiently on many different HPC architectures. The choice of how to express these concepts in code, be it OpenMP 4.0, OpenCL, OpenACC, TBB, Cilk+ or other methods lies with the programmer. It is well worth the time spent to evaluate each of these frameworks in terms of portability and the help they provide to avoid common performance pitfalls such as allocating vector aligned memory and managing thread placement. In this way the HPC programmer can minimize the number of external, system specific, and potentially obtuse bits of information that must be “twiddled” to achieve high or peak performance.
Next: We discuss the challenges with partitioning data in a distributed environment.
Rob Farber is recognized author and consultant with an extensive background in HPC and a long history of teaching and working with national labs and numerous corporations engaged in HPC worldwide. Rob has authored/edited several books on Intel Xeon Phi and GPU computing. He is the CEO/Publisher of TechEnablement.com.
Hello Rob,
You said that when it comes to the TLP model, programmers only have to focus on one thing : use as many threads as possible.
While in the context of a GPU this is indeed rather easy and efficient, I have to say that i’m currently a bit dubious about how this will hold for the future multiprocessors / CPUs.
The number of cores (and threads) on multiprocessors keeps increasing, and we have to wonder about how the scalability of our applications will evolve, and if it will be able to leverage to power of our, let’s say, 20 or 40 or 100 threads.
For example, many people I know are experiencing difficulties when it comes to take advantage of the roughly 60 cores of the Xeon Phi architecture.
As the number of cores/threads is increasing, we are indeed exerting more and more pressure on our NUMA architecture and caches. Cache-coherency, as well as control on how & where our physical threads spawn, are scheduled and act is becoming a bit more difficult.
While parallelizing loops with pragma languages (such as OpenMP) is an easy task to do, the scalability curve of our thread-parallelized loops will flatten quickly if we don’t have enough work to feed them. More work is likely to mean more iterations / more data, which brings us to think about how clever we’ll have to be with our caches and NUMA architecture to have efficient memory access.
More threads means more thread-spawning, and maybe more collisions & synchronizations in code section that are not embarassingly parallel.
* Do you think leveraging the power of all our threads on our multiprocessors or some accelerators will quickly reach its limits , without resorting to some form task-parallelism or MIMD?
* Do you know what steps are currently being taken to adress this point?
Thank you for the article,
Valentin