OpenMP is the preferred model for programming applications which can take advantage of more than one thread on a shared memory system. Although automatic (or at least helpful) parallelization through the use of compiler directives can increase performance of the application, care must be given to use the OpenMP directives carefully. The recently released OpenMP 4.0 standard now allows for the offloading of portions of the application, in order to take more advantage of many-core accelerators such as the Intel Xeon Phi coprocessor.
There are three basic programming models to take advantage of coprocessors.
- Native – just run the application on the coprocessor, on its own operating system.
- Symmetric – use both the host CPU and the coprocessor together, simultaneously.
- Offload – run almost all of the application on the coprocessor
The “Offload” mode is the most challenging from a programming perspective, but can lead to the largest increase in performance. In addition, this allows developers to take advantage of future evolution of the coprocessor over time. The challenge is to structure a code to take advantage of both the host CPUs and the coprocessors architecture. Using OpenMP 4.0’s new offload directives, substantial performance gains can be realized while load balancing between the CPU’s and any number of coprocessors.
As an example, an N-Body kernel will be examined. From a basic N-Body kernel, OpenMP directives were used to parallelize the various loops. In addition, the OpenMP directive “simd” was used to vectorize the two main loops. Strong scaling was observed on both the host CPU and the Intel Xeon Phi coprocessor. Even with this small amount of work, a respectable speedup of about 3.2X was achieved, but was still only about ¼ of the peak speed of the coprocessor.
The next step is to offload the entire calculation. The directive “omp target” will offload the computation to the Intel Xeon Phi coprocessor. In addition, “omp target data” was used to send all of the data to the coprocessor. Speedups were observed to be about 3X from just using the main CPU.
Another optimization that can be used is to combine the main processor with the coprocessor version. By offloading some of the work to the Intel Xeon Phi coprocessor while retaining some work for the Intel Xeon processor additional speedups can be realized. It is important to make sure that the right amount of work is sent to each device, so that both the host and the device can work in parallel. To allow for the computation to continue to run in parallel on the host system, OpenMP nested parallelism will be used. Next, it is necessary to distribute the work on the host, AND the device. By modifying slightly the Newton function to limits its work, the application can be controlled to allocate certain amount of work on the device.
An interesting concept is to realize that the host can perform computations at a given speed, and can be expressed as iterations per second for a given loop. Then, the amount of work that the host and device can perform can be assigned, and can be adjusted from the previous iteration. By initially setting the ratio of host to device, the value can be adjusted during the run of the program.
Once these changes have been made, the speedup observed was up to 4X using both the host and the device concurrently. As the number of coprocessors increases, additional use of OpeMP directives can be used to parse up the work. Using omp_get_num_devices() would then be used. The workload can then be balanced over the host and the multiple devices, using observed ratios of the work that can be completed on the host and the devices. In this latest iteration of the optimizations, a performance increase compared to just the host, and using two Intel Xeon Phi coprocessors show about a 6.5 X improvement.
Source: ICHEC, Ireland