In order to get maximum performance from the main CPUs or coprocessors, it is necessary to understand the algorithms as well as help the compiler generate optimal code, in terms of the resulting performance. Although the compiler can do an excellent job given the right conditions, there is a lot of work that can be done by the developer to help the compiler.
- Create loops that can be vectorized – understand the limits to the compiler and write code that the compiler can vectorize.
- Aid the compiler by putting in directives into the source code. This allows the compiler to generate more optimum code, overriding its own internal checks.
- Use SIMD directives with care. Make sure that the developer understands the algorithms fully.
- VECTOR and NOVECTOR directives – These directives can be used as hints to the compiler.
- Data alignment – make sure that the data is aligned for the target machine. This allows the compiler to create data objects in memory for faster execution. It is important to also tell the compiler that the data is aligned.
- Tell the compiler through a pragma that the data is aligned. This also allows for faster movement of the data between the main CPU and the coprocessor, such as the Intel Xeon Phi coprocessor.
- Use array sections, which encourages the compiler to use vectorization.
- Don’t overlap array sections.
The vector parallel capabilities of the Intel Xeon Phi coprocessor are similar in many ways with vectorizing code for the main CPU. The performance improvement when coding smartly and using the tools available can be tremendous. Since the Intel Xeon Phi coprocessor can show very large gains in performance due to its extra wide processing units.
Although it is time consuming to look at each and every loop in a large application, by doing so, and both telling the compiler what to do, and letting the compiler do its work, performance increases can be quite large, leading to shorter run times and/or more complete results.
Source: Intel, USA