Phil Pokorny – Chief Technology Officer for Penguin Computing
Arend Dittmer – Director of HPC Product Management for Penguin Computing
In the never ending quest for better performance, the HPC industry has
gone through many paradigm shifts, each triggered by disruptive
technologies. The current move towards streaming multiprocessor
accelerators, such as GPUs from Nvidia and AMD as well as Intel’s
upcoming MIC (Many Integrated Cores) architecture is in many ways no
different than previous disruptive technologies that have changed the
HPC landscape. What does make the current ‘revolution’ different is
that these technologies are more difficult to leverage: software
architects and developers need not only embrace a new programming model,
but must also understand the underlying hardware architecture of each
accelerator platform to produce efficient code.
The saying “everything old is new again” comes to mind. We have seen
the transition from vector processor-based systems and ‘big iron’
SMP and NUMA architectures, to clusters of interconnected commodity
systems running Linux. When heat dissipation issues prevented further
performance gains through increased clock frequencies, multi-core
processors emerged. After memory controllers were integrated with
processors on the same die, NUMA systems made a comeback. Along with
these architectural changes, programmers who got used to new programming
models were not too concerned with actual hardware implementations.
Development tools and standardized compiler directives and APIs such as
OpenMP, MPI or POSIX threads provided a layer of abstraction that
allowed for producing portable code.
Now a hybrid model for HPC has emerged that combines traditional
latency-optimized CPUs and high throughput streaming multiprocessor
accelerators, driven by the need for higher levels of efficiency and
performance per watt. These accelerators, which are characterized by
hardware multithreading, a multitude of relatively simple processing
units, and a SIMD (Single Instruction Multiple Data) execution model are
effectively bringing vector processing back to supercomputing. Currently
this trend is more prominent for the largest HPC deployments. Three out
of the five systems leading the TOP500 list are hybrid architectures;
the number of accelerators that power TOP500 systems has increased from
three in mid- 2008 to nineteen by mid-2011. But these numbers do not
exactly support the notion of a rapid adoption of streaming
multiprocessor accelerators and the adoption rate of these accelerators
by the HPC mainstream has been even slower.
Why is this new technology adopted at a slower rate than previous
disruptive technologies? The introduction of hybrid architectures poses
software challenges that are similar to those associated with previous
paradigm shifts. There is the obvious and inevitable issue that some
algorithms are not well suited for the SIMD execution model streaming
multiprocessor accelerators have been optimized for. Even for
algorithms that lend themselves well to being ported the learning curve
is steep, expertise is scarce and legacy code is often not well
understood.
What is different this time is that programmers can no longer stay
hardware agnostic. They have to develop for an accelerator type and be
very aware of the respective hardware architecture. There are no
software tools and programming models that enable developers to write
code that can be easily ported to run on accelerators platforms from
multiple vendors. As a consequence, there is also a lack of commercial
applications supporting these accelerators. Software vendors have not
yet fully committed to this technology, as no clear winner has emerged
and development for all accelerators is cost prohibitive. A quote by
Intel’s Tim Mattson makes the point; “We stand at the threshold of a
many-core world. The hardware community is ready to cross this
threshold. The parallel software community is not.”
Providing standardized syntax and semantics, OpenCL is a step towards
portability. However, OpenCL is a low-level language and OpenCL code
tuned for a single architecture, will most likely run with poor
performance, or not at all, on another. Another approach to a common
programming model for streaming multiprocessor accelerators is based on
standardized compiler directives. The OpenMP Architecture Review Board
is currently considering the addition of directives and library routines
to enable support for streaming multiprocessor accelerators for version
four of the OpenMPI API specification. Integrating a new programming
model with existing compiler technology, based on traditional
latency-optimized processors, will be a daunting task for compiler
vendors.
The issue of memory organization illustrates the point. Each vendor’s
accelerator architecture implements a different memory management model.
This can have a big impact on application performance. There are
currently three different implementations a programmer has to be aware
of:
1. An accelerator’s physical memory address space that is separate
from a host system’s physical address space without support for a
transparent unified virtual memory space. With this approach, explicit
data copy operations to and from the accelerator are required, and two
address spaces have to be managed.
2. A virtual address space that spans two distinct physical areas of
memory with data moving as needed using a page-faults. This transparent
implementation makes it easier for developers to write or port
accelerator-enabled code. In order to optimize performance, a
programmer will still have to provide guidance for data copy operations
considering physical memory locality.
3. A unified address space for system and accelerator through a physical
integration of accelerator and processor on a single die sharing a
single memory controller, but with separate caches for accelerator and
processor. With this model, a developer has to be concerned with the
locality of data in the accelerator’s cache. Interestingly, the
latter approach was initially driven by the low power requirements of
the mobile market. Working with AMD and an industry leading HPC site,
Penguin Computing has been prototyping systems that use this technology
to evaluate its viability for large and Exascale computing.
Looking at the big picture, the performance benefits and the attractive
power/performance ratio that streaming multiprocessor accelerators have
to offer are so compelling that they will be adopted on a much larger
scale. Users whose needs are met by readily available generic and
special purpose libraries that have already been ported, can easily
leverage the benefits of streaming multiprocessor accelerators without
having to worry too much about code portability. The speed of further
adoption of multiprocessor stream accelerators will depend on the
availability of a programming model that enables developers to exploit
each architecture’s performance benefits through portable code that
can be tuned for specific architectures with a reasonable effort.
For related stories, visit The Exascale Report Archives.