The use of vector instructions can speed up applications tremendously when used correctly. The benefit is that much more work can be done in a clock cycle than by performing the operation one at a time. The Intel Xeon Phi coprocessor was designed with strong support for vector level parallelism.
For applications that are well understood in terms of their algorithms, developers can write or re-write code such that the compiler can easily identify where vector instructions can be used. There are a number of compilers in the market today that can recognize where to use vector instructions. The use of libraries is an excellent start, such that an application can be vectorized by just calling a library where appropriate.
In addition to using a library of functions that have already been vectorized, support is available for the developer to specify where the compiler should issue vector instructions and their associated data. There are a number of methods to achieve vectorization, which can be summarized as:
- Use a math library as described above. An example of this is to use the Intel Math Kernel library (Intel MKL).
- Let the compiler, if it is capable of vectorizing the code where the compiler can. This is also referred to as auto vectorization. An example of this would be, inside of a for loop, c(i) = a(i) + b(i) .
- Directives in the application to assist the compiler, such as #pragma simd , which will generate Single Instruction, Multiple Data (SIMD) instructions.
- Using array notations. This can be done in C or FORTRAN and reflects the array operations that the developer implements. For example, c[i:MAX] = a[i:MAX] = b[i:MAX]
- Elemental functions. This allows for a developer to use vectorization when a program does one operation at a time that could be done in parallel.
When these techniques are used either individually or in combination in different areas of the application, the performance will surely be increased, in many cases without a lot of effort.
Source: Intel, USA