For maximum performance, data needs to flow into and out of the vectorization units. There are a few things to remember regarding laying out the data to gain high performance. These include, data layout, alignment, prefetching, and store operations.
For systems with a register width that can hold 16 single precision floating point values, it is advantageous to make sure that all values into and out of the registers are aligned on 512 bit boundaries. This assures that the maximum performance is achieved. In the example of c = a + b, where a, b, and c are vectors, and the vectors a laid out in memory, the all is well. However, if the data is not aligned in memory, more instructions will be needed to make sure this happens.
It is important to understand the movement of data from main memory to L2 cache to L1 cache. Also, if data is to be reused, the instructions that require that data should be close together, so that the data remains in L1 cache.
Data alignment is important in that the computational units will perform best when the data is aligned to 64 byte boundaries, for example with the Intel Xeon Phi coprocessor. In some cases, depending on the length of a row in a matrix, it will be advantageous to pad the rows of the matrix to ensure row alignment. There are a number of compiler directives that can assist with this concept.
Prefetching is also extremely important in HPC applications that use coprocessors. If the vectors are aligned, then the data can be streamed to the math units very efficiently, with data being prefetched, rather than the system having to load registers from various memory storage. For example, when the vector is in memory, then it can be moved efficiently to the L2 cache and then the L1 cache. If the data is all over a memory sub-system, then delays will occur as the individual data elements have to be moved towards the processor.
Source: Intel, USA