Preparing Code For Parallel Execution

With the advent of the tremendous compute density of new processors, it is important to understand if an application can take advantage of multicore. “Developers should understand if an application might be ready to run in a highly vectorized or many core environment before attempting to do the work necessary to obtain the high performance that might be expected.”

Helping the Compiler Speed Intel Xeon Phi

The vector parallel capabilities of the Intel Xeon Phi coprocessor are similar in many ways with vectorizing code for the main CPU. The performance improvement when coding smartly and using the tools available can be tremendous. Since the Intel Xeon Phi coprocessor can show very large gains in performance due to its extra wide processing units. “Although it is time consuming to look at each and every loop in a large application, by doing so, and both telling the compiler what to do, and letting the compiler do its work, performance increases can be quite large, leading to shorter run times and/or more complete results.”

Compiler Directives for High Performance Computing

“Directives can be used as hints to the compiler to vectorize a loop. The developer would have better knowledge of any dependencies that a compiler, which must adhere to a number of rules when deciding if a loop can be vectorized. Directives force the compiler to vectorize, based on the knowledge of the developer, thus, if something does not work correctly, it is the responsibility of the developer to fix it, rather than blame the compiler.”

Vectorization Steps

The process to vectorize application code is very important and can result in major performance improvements when coupled with vector hardware. In many cases, incremental work can mean a large payoff in terms of performance. “When applications that have successfully been implemented on supercomputers or have made use of SIMD instructions such as SSE or AVX are excellent candidates for a methodology to take advantage of modern vector capabilities in servers today.”

Using Vectors on Intel Xeon Phi

The use of vector instructions can speed up applications tremendously when used correctly. The benefit is that much more work can be done in a clock cycle than by performing the operation one at a time. The Intel Xeon Phi coprocessor was designed with strong support for vector level parallelism. “When these techniques are used either individually or in combination in different areas of the application, the performance will surely be increased, in many cases without a lot of effort.”

Intrinsic Vectorization for Intel Xeon Phi

“It is important to be able to express algorithms and then the coding in an architecture independent manner to gain maximum portability. Vectorization, using the available CPUs and coprocessors such as the Intel Xeon Phi coprocessor, are critical for HPC applications where performance is of the highest importance. However, since architectures change over time and become more powerful, using libraries that can adjust to the new architectures is quite important.”

Modernizing Code with the Intel Vectorization Advisor

Threading plus vectorization together can increase the performance of an application more than one technique or the other. Threading and vectorizing an application are two techniques that are known to increase the performance of an application using modern CPUs and coprocessors. However, a deep understanding of the application is needed in order to make the decisions needed and to rewrite portions of the application to take advantage of these techniques. In cases where the developer might not be familiar with the code an automated tools such as the Intel Vectorization Advisor can assist the developer.

OpenMP and SIMD Instructions on Intel Xeon Phi

“Vector instruction sets have progressed over time, and it important to use the most appropriate vector instruction set when running on specific hardware. The OpenMP SIMD directive allows the developer to explicitly tell the compiler to vectorize a loop. In this case, human intervention will override the compilers sense of dependencies, but that is OK if the developer knows their application well.”

James Reinders Presents: Vectorization (SIMD) and Scaling (TBB and OpenMP)

James Reinders from Intel presented this talk at the Argonne Training Program on Extreme-Scale Computing. “We need to embrace explicit vectorization in our programming. But, generally use parallelism first (tasks, threads, MPI, etc.).”