The process to vectorize application code is very important and can result in major performance improvements when coupled with vector hardware. In many cases, incremental work can mean a large payoff in terms of performance.
When applications that have successfully been implemented on supercomputers or have made use of SIMD instructions such as SSE or AVX are excellent candidates for a methodology to take advantage of modern vector capabilities in servers today.
The first step is to measure the baseline performance in what would be the production code, but before any vectorization is attempted. This build should not be using any debugging flags and should be using traditional compiler optimization flags (-O2 or –O3).
A second step would be to use various tools to identify hotspots in the code, during a normal execution run or a set of runs. By identifying these hotspots, it will be easier to focus on which parts, down to the loop level to focus on. Also, a determination should be made as to how many (what percent time the run takes) hotspots to look at, or set a limit in terms of percent of areas to investigate.
Another step would be to look at reports that can indicate if a loop is a candidate to be vectorized. Compiler reports can identify where to look more closely at the code, in order to see if some of the calculations can be done in parallel, for example on an array.
Once the previous tool has been run and areas that can be vectorized are determined, try to use compiler tools to auto-vectorize the parts of the code that might lead to performance gains. Examine the advice that various tools can give, in order to determine if the auto-vectorization is accurate.
Using the advice given by these tools can let the developer know if there are dependencies in the code that need to be eliminated. It is important that the developer understand the algorithms and does not take the advice blindly. Although the new code might be vectorized, the answers might come out wrong. Look at how the loops would perform if run backwards, which will help to identify if there are any dependencies.
It is important to repeat these steps until all identifiable loops and sections of the code have been investigated. Continuing to iterate with this process can lead to improved performance, but is no substitute for restructuring the code with more parallel algorithms.
Source: Intel, USA