In a heterogeneous system that combines both the Intel Xeon CPU and the Intel Xeon Phi coprocessor, there are various options available to optimize applications. Whether one has an advantage over another is somewhat dependent on the application that is being run. Comparisons can be made comparing the two methods, as long as the algorithm lends itself to run and take advantage of either OpenMP or OpenCL.
By setting up an environment with the Intel Xeon E5 and Intel Xeon Phi, a range of benchmarks can be run to determine base values and then look at after optimizations have been performed. Tools such as the Intel Vtune Amplifier can help to understand how a code is running, and where optimizations might take place. Optimizations that can become evident include avoiding copying arrays, reducing the number of overall instructions used, changing divisions to reciprocals, and vectorization of the inner block elements of a code. An interesting output from the Intel VTune Amplifier is the vectorization intensity, which is an indication of how well the application is vectorized.
Benchmarks can be run taking an OpenCL version and duplicated in an OpenMP version, in which parts of the code are offloaded to the Intel Xeon Phi coprocessor. These techniques enable developers to move code from an OpenCL based application to one that is OpenMP based and can take advantage of vectorization. Performance results show that over a range of benchmarks from the Rodina suite, perform from even to 4 times better. It is interesting to note that without closely investigating the application, just moving from an OpenCL to OpenMP implementation may give worse results. In many cases, OpenMP gives developers the ability to identify vectorizations and exploit data locality, which OpenCL developers might be constrained with. The most important factors that developers should be aware of is the algorithms choice, the memory layout, and the data access patterns.
Source: Intel, Russia