Computing With MPI in Heterogeneous Environments

mpi-forum-logoOver the past 20 years, the Message Passing Interface (MPI)  has become an indispensable set of tools that allow developers to take advantage of increasing hardware capabilities, without having to understand the hardware layer. With the MPI interface working, developers can create applications which use multiple sockets and  servers to speed up complex applications. However, over the past two decades, the underlying system has grown much more complex. What was just a single cpu as a single socket connected to a flat network, has now become a node with multiple processors with multiple cores, additional many core coprocessors and perhaps a fat-tree network. The number of communications channels in a modern cluster is about two orders of magnitude greater that it once was.

Applications have been modernized in many cases to use both MPI and OpenMP, as well as offloading through MPI to the coprocessor. By hybridizing the application, the best and most efficient use of the computing resources is possible. A modern HPC system with multiple host cpus and multiple coprocessors such as the Intel Xeon Phi coprocessor housed in numerous racks can be optimized for maximum application performance with intelligent thread placement. A pure MPI approach is impractical for large HPC systems which involve massive amounts of communication between the threads.

A set of MPI providers for the Direct Access Programming Library (DAPL) can be set up that will allow for a wide range of uses. Different providers have different roles, and can be chosen for small messages, large messages within a node, and large messages between nodes. On the Stampede system at the Texas Advanced Computing Center (TACC), the first provider is set by the Intel MPI implementation and uses the local InfiniBand card on the node. The second provider will use shared memory by default for communication on the same system. The third provider overcomes the point-to-point local read bottleneck of the Intel Xeon E5 processor. To overcome this, the Intel MPI system has introduced a design that provides better performance for MPI messaging across hosts and coprocessors, resulting in up to 7 GB/sec between the CPU and a PCI device.  This is also known as an InfiniBand proxy.

A sample application, the lattice Boltzmann method (LBM) was used on the TACC Stampede system to show the performance improvement of using the proxy provider as compared to a no proxy implementation. With high message sized, the proxy version showed about a 5 X improvement in terms of bandwidth (raw measure), and about a 31 % performance improvement over just a coprocessor implementation.

As nodes and clusters have become more complex, it is important to realize that designating the appropriate provider for large MPI applications is critical to taking advantage of all of the compute power available.

Source: TACC, USA

Transform Your Code

Deliver top application performance and reliability with Intel Parallel Studio XE: A C++ and Fortran tool suite that simplifies the development, debug, and tuning of code. Compatible with leading compilers.