Internode Programming With MPI and Intel Xeon Phi Processor

Sponsored Post

While many application tuning guides focus on understanding the performance of an application on a single node, a very scalable application will use multiple node to achieve very high performance. Parallelism is of course achieve within a node through message passing, threading and vectorization, but also between nodes.  To develop an application that executes on a number of nodes, a new paradigm is needed that allows for messages to be passed between the nodes and a comprehensive API that allows developers to take advantage of many nodes that make up a large scale system.

The Message Passing Interface (MPI) has been available as a standard since 1993. The current version (as of August, 2017) is MPI 3-1. MPI  is a portable message passing API that allows for very large scale application development and can be used by popular programming languages such as Fortran, C and C++. MPI is not sanctioned by any major standards bodies, but had become widely used in many applications and domains due to its efficiency and constant improvements.

As the number of cores and memory on each node has grown over the past several  years, OpenMP has become a popular way for applications to take advantage of the power of each node.  Many applications today use both MPI for internode communications and OpenMP for intranode communications and the sharing of resources.

[clickToTweet tweet=”MPI on Intel Xeon Phi Processors can greatly speed up your application.” quote=”MPI on Intel Xeon Phi Processors can greatly speed up your application.”]

The key component to MPI is a rank or what some call the worker number. As each piece of the executable is copied to another node, the rank of that node must be kept handy. For example, a part of the application is usually considered rank 0, and then the other systems would be designated as rank 1, rank2, rank 3, etc. Different parts of the overall workload would be assigned to a different rank, in MPI speak. At various times during the applications execution, the results from each rank would be communicated back to the master node or data could be sent from one rank directly to another.

Each MPI rank can create a number of threads that are equal to the number of cores in the system times whether hyperthreading is used or not. In many cases, maximizing the number of threads on a server for computation and creating less ranks can benefit the overall performance, as there will be less between system communication and less resources for the MPI internal buffers that are needed.

When it is known that rank to rank direct communication would be useful,  MPI_Put and MPI_Get are available. This can speed up the overall performance of an application, as overall synchronization can be reduced.

While MPI was originally developed for general purpose CPUs and is widely used in the HPC space in this capacity, MPI applications can also be developed and then deployed with the Intel Xeon Phi Processor. With the understanding of the algorithms that are used for a specific application, tremendous performance can be achieved by using a combination of OpenMP and MPI. Running applications that use MPI,  understanding the balance of work and then tuning these applications will be discussed in future articles.

Download your free 30-day trial of Intel® Parallel Studio XE 2018