On the Road to Exascale: The Challenges of Portable Heterogeneous Programming

Print Friendly, PDF & Email

We heard some very good reviews of a talk given by Doug Miles of The Portland Group at the bi-annual Clouds, Clusters and Data for Scientific Computing technical meeting outside of Lyon, France in mid- September. Most of the talks from that meeting are available online at the CCDSC 2012 website, but the PGI talk did not include any slides. PGI has provided the Exascale Report with a copy of the transcript from the talk, which we have reproduced here with a few minor edits.

Most of today’s CPU-only large-scale systems have a similar look-and-feel: many homogeneous nodes communicating via MPI, each node has a few identical processor chips, each chip has multiple identical cores, each core has some SIMD processing capability. Programming one such system is very like programming any other, regardless of chip vendor, number of total cores, number of cores per node, SIMD width or interconnect fabric.

Setting aside Accelerator-enabled systems for a minute, how did we get here? How did we reach this level of homogeneity from such heterogeneous HPC system roots? 25 or 30 years ago we had vector machines, VLIW machines, SMP machines, massively parallel SIMD machines, and literally scores of different instruction set architectures. How did systems become so homogeneous?

Well, there are at least 4 pillars on which today’s HPC systems are built and which impact, and have been impacted by, programming languages and models:

  • Instruction-level parallelism (ILP)
  • Commodity microprocessor technology
  • SMP node-level parallelism
  • Scalable or system-level parallelism

As we have refined each of these technologies, compilers and programming models have evolved based on what we can successfully hide from the programmer, and what we must expose to enable high performance programming. They also have determined what we can successfully virtualize, and what we must make explicit in our programming models and languages. By exploring this evolution, we can learn some important lessons about how programming of the coming set of heterogeneous architectures will likely develop and evolve on the road to Exascale computing.

So, what can we hide?

Examples in today’s commodity CPU-based systems include assembly code, superscalar out-of-order execution, interconnect technologies, interconnect topologies, and SIMD or vector register lengths. These are not exposed in HPC programming languages or models today, and you don’t need to know much about them to extract performance from HPC systems. Now, there always have been and always will be library developers who go to the lowest levels of system programming to extract maximum performance on algorithms that are so widely used that the tuning effort is worthwhile. These details are not totally inaccessible to library developers, but I’d argue they are effectively hidden from most HPC programmers.

Two examples where we have failed to hide underlying technologies include VLIW and distributed memory communication. VLIW tries to solve exactly the same problem as superscalar technology, relying on compiler technology instead of hardware. Some may argue otherwise, but the inability of compilers to either completely hide VLIW, or successfully virtualize and expose it, seems to have led to failure of VLIW in general-purpose CPUs. The reliance on compilers to perform very difficult analysis and transformations for all-or-nothing optimizations resulted in too many cases where VLIW CPUs weren’t competitive, or at least weren’t cost competitive, with the other available alternatives.

All attempts to hide distributed memory communication have failed similarly. Many valiant attempts to virtualize and expose it − Linda, CRAFT, HPF, Distributed OpenMP − have either failed outright or have been only marginally successful. As a result, distributed memory communication is exposed and explicit in nearly every HPC system today, and MPI is so successful that it is used even on systems like the large SMP machines from SGI that don’t necessarily require it.

The lesson here is we should hide what we can, but if hiding a feature or technology requires global all-or-nothing compiler analysis and transformations – like auto-parallelization or distributed memory, and apparently even VLIW – then hiding is doomed to failure. Don’t sit back and wait for a solution that is not going to happen. Translated into today’s terms, you shouldn’t be sitting back and waiting for compilers to hide and automatically use GPUs and accelerators. If you do, you’ll be left by the wayside on the road to Exascale.

What can we expose and virtualize?

So, hiding is the best of all possible worlds, but if we can’t hide a feature the next best thing is to expose it and virtualize it. A feature or technology is exposed if you know it’s there, and you must pay attention to it in order to extract performance from the system. In this context I mean it’s “virtualized” if you know it’s there, but it’s abstracted in the programming model into a form that is high-level and portable – vectors and SMP multi-processing have been successfully virtualized in today’s programming languages and models. There is always a performance penalty for this virtualization, but it prevails when the portability and productivity benefits outweigh that penalty.

Vectors had to be exposed early on for programmers to successfully adapt applications for high performance on vector machines. Vector registers and register lengths were successfully hidden, but the existence and concept of vectors was exposed. In what was really a brilliant move (maybe accidental, but still brilliant), Cray exposed vectors in the form of feedback from the compiler to the programmer.

The concept of vectors virtualized nicely into existing Fortran and C iterative constructs and array data structures, and the Cray compiler indicated when loops vectorized. More importantly, it indicated when they did not, and why not. Because the payoff for vectorization was so high, HPC developers paid attention to this feedback and adapted their programs to maximize automatic vectorization.

This recoding effort was all the more successful because not only did vectorized programs achieve high performance on a Cray, but the performance benefits were portable to other vector machines from NEC, Fujitsu, Hitachi, Convex, and on and on and on.

We hoped early on that auto-vectorization might lead to successful auto-parallelization for SMP systems, but compiler technology came up short there. The global analysis required for auto-parallelization was (and is) just too difficult, and is impossible if you start out with a bad algorithm or a poorly written program. As a result, to successfully extract performance from SMP systems we had to expose SMP threads to the programmer in the form of Cray autotasking directives, Sequent directives, SGI directives, etc. These directives were so successful that HPC end-users demanded portability and eventually forced consolidation of a wide variety of vendor-specific directives into the standard OpenMP directives we all use today.

Keep in mind this successful formula: vectors and SMP parallelism were successfully virtualized and exposed in the dominant existing programming languages of the day. There is another lesson there.

In general, virtualization makes it far less tedious to write programs – it improves productivity – and in the best examples like vectorization and OpenMP it results in both functional and performance portability as well.

What must we expose and make explicit?

If we can’t hide a technology, and we can’t successfully virtualize and expose it, the only option left is to expose it and make it explicit.

Obviously distributed memory communication is exposed. Not only is it exposed, but we’ve had to make the concept of it explicit in the form of MPI to enable high performance. MPI virtualizes the interconnect, but requires that all inter-process communication is explicitly coded and managed by the programmer. Seemingly the only way to force HPC programmers to think hard enough about the communication that occurs in their programs, and to understand it well enough to make it efficient when it occurs, and to keep it to an absolute minimum, is to force them to write it all out manually. All attempts to hide it or virtualize it have come up short so far.

We tried to virtualize communication in the form of HPF and several other languages and programming models, but most have failed and none has been remotely as successful as MPI. Why? Because the penalty for getting it wrong is so high and the global analysis required to hide or virtualize it is so difficult. Even small steps forward like Co-Array Fortran and UPC have been only marginally successful versus MPI.

One lesson here is that when we expose a feature or technology and make it explicit − as with MPI communication − the payoff for a total re-write has to be huge. It has to be measured in large factors and not just percentages. Another key lesson from MPI is that programming languages and models are only successful if they are standardized. True believers will argue until they are blue in the face on the superiority of Co-arrays and UPC vs MPI, but early standardization and widespread adoption of MPI seems to have marginalized any inherent advantages of those models.

Who are the winners and losers so far, and why?

In the area of ILP, we have tried and vetted vector processing, VLIW processing, superscalar processing, hyper-threading and other styles of multi-threading. The two big winners here are vectors, which are used in every mainstream CPU today in the form of SIMD instructions, and superscalar technology, which has been a staple of every commodity CPU since the late 1990s. VLIW appears to be out of the running, largely because of shortcomings in compilers and programming models. The jury is still out on hyper-threading.

In the ISA wars, the great lesson is to co-opt successful technologies developed in HPC, and have a huge commodity market into which you can sell. Superscalar Crays and DEC Alpha chips are the forebears of today’s commodity CPUs, vectors become SIMD extensions, SMP begets multi-core CPUs. The importance of commoditization can’t be overstated. Clusters of workstations wiped out integrated MPPs and classic vector machines, and clusters of PCs wiped out clusters of workstations due to this commodity effect. Commodity GPUs are now posing a threat to commodity CPUs as the HPC workhorse, precisely because they have 10x – 20x the raw compute power and their development is supported by a huge commodity market outside of HPC.

I would argue that any HW technologies which aren’t rooted in or destined for a commodity market may be attractive in the near-term but are probably doomed to failure in the long run. Be mindful of this as you decide where to invest your time and money with respect to new hardware technologies. If it’s not already obvious, ask your hardware suppliers when they pitch new technology at you where and how it will find a place in a commodity market.

So, will clusters of GPUs or clusters of cell phones wipe out clusters of today’s commodity CPUs? It seems unlikely, but it does seem very likely the road to Exascale will include lot of great competition and an eventual shakeout. In the last 30 or 40 years we have distilled out technologies and programming models that are proven to work. We’ve ended up with a canonical HPC architecture that until the last few years was becoming more and more homogeneous.

Now, with the advent of compute-capable GPUs, reactions by the entrenched processor vendors, and attempts by mobile CPUs to climb up the food chain, the potential space of architectures is once again opening very wide.

Exascale HPC systems solution space?

So, what does the Exascale HPC systems solution space look like today? It looks like a coming onslaught of heterogeneous HPC architectures, and we have to figure out how to program them.

Intel is promoting Intel 64 CPUs coupled (or not) via PCI to purpose-built x86-heritage MIC accelerators. They propose to program these systems with MPI, OpenMP, Intel’s Language Extensions for Offload (LEO) directives, C vector syntax extensions, Cilk+, Threading Building Blocks and eventually OpenMP extensions for co-processors.

NVIDIA is promoting Intel or AMD CPUs (for now) coupled via PCI to increasingly general-purpose GPUs. They are working to integrate ARM CPUs into the mix as part of project Denver, but so far have been very close-to-the-vest on whether the ARM replaces the x86 or just makes the GPU easier to program. They propose to program these systems with MPI, CUDA C, CUDA Fortran, OpenACC directives, and eventually OpenMP extensions for accelerators.

AMD is promoting AMD CPUs coupled via PCI to general-purpose GPUs for HPC servers in the near term, and fully integrated APUs with a multi-core x86 and a GPU sharing the same silicon die in the medium- to long-term. They propose to program these systems with MPI, OpenMP, OpenCL, OpenACC and eventually OpenMP extensions for accelerators.

IBM seems to be sticking to its guns with Blue Gene, making it as scalable and energy efficient as possible, and is a leader in the current Top 500 and Green 500 lists with non-accelerator-based machines. That said, they are happy to sell you any of the above if that is what you really want.

ARM Ltd and its partners are promoting future 64-bit ARM CPUs, possibly coupled to accelerators of one sort or another. In this case, there is an all-new crop of GPUs from the embedded space that are likely to appear on ARM server SoCs whether you want them or not − the Imagination Rogue GPUs and ARM’s Vithar GPUs being the most likely, but Qualcomm and other large players have their own proprietary GPUs. Also, there are non-GPU accelerators from other large embedded players which could be incorporated as well.

What we’ve learned from the past is that portability and performance portability of software across the various types of HPC systems are requirements; productivity is highly desirable, but not if it compromises either of the other two to a significant degree. Any language or programming model that is not portable across most or all types of HPC systems and is not reasonably future-proof has little chance of surviving. We all need to keep this in mind as we develop, adopt and promote the wide variety of programming models in play today as candidates for use in programming tomorrow’s Exascale systems.

PGI on the Road to Exascale

With respect to programming models, the goal is to virtualize architectural concepts where possible and expose what you must in order for skilled programmers to grasp and understand a canonical machine model from which they can extract performance and for which they can write applications that are both functionally and performance portable. This is what we achieved with vectorization, OpenMP and the virtualization of interconnects by MPI, and why today’s CPU-only architectures all look so similar.

For PGI, MPI and OpenMP are givens. We have to support them and continue to evolve our products to keep pace with those standards. They may eventually be displaced, but PGI won’t be investing in any efforts specifically intended to displace them.

In our view, OpenACC meets the basic criteria for long-term success – it virtualizes into existing languages the concept of an accelerator for a general-purpose CPU where they have potentially separated memories. It exposes and gives the programmer control over the biggest performance bottleneck – data movement over a PCI bus – and is designed to be portable and performance portable across all of the major CPUS and Accelerators. One important feature of OpenACC is that it pre-supposes a day when CPU and Accelerator memories are not separated. When that occurs, you won’t have to make any changes to an OpenACC program − it will just run faster because the compiler will no longer insert the implicit communication required to keep separated memories coherent.

OpenACC also virtualizes the mapping of parallelism from the program onto the hardware. We believe this will be successful because it’s been done before in the form of vectorization and OpenMP, and in OpenACC is just being applied in a different way. OpenACC enables you to express any legal CUDA or OpenCL kernel launch schedule in a directive-based syntax, which means all the potential parallel performance of an accelerator designed to run those low-level explicit models is accessible through OpenACC. Our challenge is to use compiler feedback to train developers how to express kernels in a way that is optimal and reasonably performance portable. In addition, most of the viable CPU and accelerator targets we see on the horizon are of a similar form − a MIMD parallel dimension, a SIMD/SIMT parallel dimension, and a memory hierarchy that is exposed to a greater or lesser degree and which must be accommodated by the programmer or compiler.
So, from our standpoint OpenACC is a given as well. PGI intends to support it on any accelerators or co-processors where it makes business sense for us. In particular, we’ll support it on any accelerators those of you in the HPC market choose to buy in significant volume.

We also see a need for a portable lower-level explicit accelerator programming model for library developers and power users. We think OpenCL is too low-level for the HPC market, and that CUDA is a better model – CUDA Fortran in particular because with just 3 or 4 minor extensions to existing languages it abstracts most CUDA API calls into a form that is equally efficient and much more readable. What you’ll see from us over the coming year or so is a proposal and implementation of a CUDA-like lower-level explicit model that can be effectively implemented on multi-core CPUs, GPUs and custom-built accelerators and is fully interoperable with OpenACC directives.

Finally, you can assume that PGI will have the full range of PGI compilers and tools technologies for ARM processor-based systems well before ARM becomes a significant player in HPC. Our goal is to be first there, and to bring to bear all the technology we’ve developed over the last 25 years to help enable the successful introduction of ARM CPUs into the HPC and server market.

For related stories, visit The Exascale Report Archives.