Applications that simulate real world physics or chemistry may contain millions of lines of code. These programs are written in high level languages such as FORTRAN or C and typically will have many loops in the code. These loops will perform some operation over a range of values and are great candidates for running in parallel. While investigating and profiling a large application to see where the hot spots are, most likely the areas to look at further involve loops.
In the past, developers would get best results if a loop was unrolled, that is, duplicating the body as many times as needed to that the operations could be operated on using full vectors. The number of iterations would reflect the hardware that the code was targeted towards. Since the application may have to run on different hardware in the future, results for todays generation of hardware may be compromised in the future.
[clickToTweet tweet=”Look at Intel Compilers. Let the compiler do its thing.” quote=”Compilers are very smart. Give the compiler a chance.”]
In fact, it is better to let modern compilers to the unrolling. Whether using a #pragma for C or C++, ora !DIR for Fortran, compilers today can unroll the loops more efficiently, leading to better performance. A simple form of a simple loop allows the compiler to do its work and can also increase the readability of the code.
There are some general rules to follow to allow the compiler to vectorize loops in an application.
- Loops within loops will only be vectorized for the inner loop. Use OpenMP for parallelization of the outer loops. In some cases, a smart compiler will be able to interchange the inner and outer loops, for vectorization.
- A loop that contains “straight-line” code can be vectorized. Obviously if within a loop there are branches or jumps to other code within the loop, vectorization will not be possible.
- The loop must be “countable”. That is, the number of iterations for a loop must be known before the loop executable code is started.
- There must not be backward loop dependencies for the loop to be vectorized. For example, the loop cannot require the 2nd statement in a loop to be executed before statement 1 to get correct results.
- There should not be subroutines or function calls within the loop. However, certain functions that have been vectorized by compilers can be included, such as the math functions, sin, cos, etc. A more complete list of math functions that may be included within loops are available online.
An example of a vectorizable loop by the compiler would look something like:
for (i=1;i<MAX; i++) {
a[i] = b[i] + c[i]
d[i] = e[i] – a[i-1]
}
In this case, the loop can be vectorized, since a[i-1] is already computed before it is used.
However, in the following case the loop is not vectorizable, since a[i-1] might be needed before it has been computed.
for (i=1;i<MAX;i++) {
d[i] = e[i] – a[i-1]
a[i] = b[i] + c[i])
}
The above discussion are just some simple tips that should be considered when writing code that can be vectorized by modern compilers. It is important to understand the flow of the application code on a larger scale as well as the lower level loop structure in order to let the compilers do what they do best.
Download your free 30-day trial of Intel® Parallel Studio XE