Concurrent Kernel Offloading

Print Friendly, PDF & Email

phiThe combination of using a host cpu such as an Intel Xeon combined with a dedicated coprocessor such as the Intel Xeon Phi coprocessor has been shown in many cases to improve the performance of an application by significant amounts. When the datasets are large enough, it makes sense to offload as much of the workload as possible. But is this the case when the potential offload data sets are not large?

Usage scenarios for using a single coprocessor device can be categorized in the following use cases:

  • A single host process offloads to a dedicated coprocessor
  • Multiple host processes offload to a shared coprocessor
  • Offloading to a shared, remote coprocessor

The Intel Xeon Phi coprocessor runs Linux and is able to schedule the work to the cpu cores. Both the OpenMP and MPI programming models are supported. However, experience has shown that the most popular method to offload computations is to use OpenMP. Specific thread placement is available through environment variables such as KMP_AFFINITY and OMP_PLACES.  There are similar options for MPI runs.

By offloading multiple kernels from the host, a number of performance improvements can be observed.

  • Thread affinity – use thread placement directives, such as MIC_KMP_AFFINITY and MIC_OMP_PLACES, for OpenMP applications. For MPI based applications explicit thread placement is needed by setting MIC_KMP_PLACE_THREADS.
  • Data transfers are important. With a single data channel between the host and the coprocessor, the data transfers can become serialized when data packages are small.  By using MPI, the contention effects are not present. If frequent transfers of small packages are needed, use a hybrid OpenMP and MPI environment.

Two examples can demonstrate the performance improvements.

DGEMM – the observation was that the performance can scale almost linearly with the number of concurrent offloads as long as the coprocessor can handle the thread count. However, when using an OpenMP version, the data transfers between successive offloads can break down the performance. This was not observed when using an MPI model. For workloads that use several dgemm like operations, programmers can partition a coprocessor for the multiplication of small or medium sized matrices.

Particle Dynamics (PD) – A little more complex case involves the offloading of a PD application. It was found that the PD compute kernel does not scale to thread counts to meet the coprocessor’s threading requirements. With 8 concurrent offloads of 30 threads each, there was about a 25 % gain over using 240 threads with a single offload.

Source: Zuse Institute, Germany and Intel, Germany

Transform Your Code

Deliver top application performance and reliability with Intel Parallel Studio XE: A C++ and Fortran tool suite that simplifies the development, debug, and tuning of code. Compatible with leading compilers.