Parallel execution is the most common technique for speeding up large applications, but it alone is not sufficient for all cases. In particular, Amdahl’s Law states that the inherently sequential portion of a program is the limiting factor of speed improvements.

An additional tool for performance enhancement is to reduce the effective latency for fetching data. For example, the memory system could be made entirely from transistors (SRAM) rather than a combination of transistors and capacitors (DRAM). Such an architecture would be prohibitively expensive for most users, so many designs today call for a memory hierarchy featuring some SRAM and a lot of DRAM. This of course exploits the **principle of locality** in which data tends to be organized in arrays or accessed in loops.

Therefore, it is beneficial for a program to organize data to be accessed repeatedly and sequentially. Consider the follow matrix-matrix multiplication:

for ( i = 0; i < N; i++ ) for ( j = 0; j < N; j++ ) for ( k = 0; k < N; k++ ) C[i][j] += A[i][k] * B[k][j];

Languages like C have row-major ordering, in which the left-most index is interpreted first. Therefore, to improve locality of reference, the program should fix the row and vary the column. The variable `i` is used twice as a row index, `k` once as a row index, and `j` never as a row index. Therefore, the loop statement that iterates with `i` should be the outermost, the statement with `k` next, and `j` last. Thus, the above code should be transformed as follows:

for ( i = 0; i < N; i++ ) for ( k = 0; k < N; k++ ) for ( j = 0; j < N; j++ ) C[i][j] += A[i][k] * B[k][j];

All we've done here is swap the order of two loop statements. This operation is legal as addition is both commutative and associative. That is, the definition of the program is not changed when these two loop statements are swapped.

The performance gains from such a simple transformation can be tremendous as we have reduced the number of cache misses. The real beauty of this transformation, though, is that it requires no special hardware beyond what is already present on most machines. That is, this optimization can be performed on just about any platform.

As a general rule of thumb, locality is improved when the loop statements are structured such that immediate data accesses are all present within a cache block. Happy transformations!