Most computer architectures today are able to execute a few instructions in parallel on a single core by having multiple control and arithmetic-logic units. The order of execution may be scheduled by the compiler (in a VLIW processor) or by the hardware itself (in a superscalar processor). Of course not all instructions can be executed in parallel simply because of dependencies.
Software pipelining, which really has nothing to do with hardware pipelining, is a loop optimization technique to make statements within an iteration independent of each other. The goal is to remove dependencies so that seemingly sequential instructions may be executed in parallel. Consider a vector computation:
z = 2x^{2} + y
Expressed in C, this action would be:
for ( i = 0; i < N; i++ ) { xsquared = x[i] * x[i]; twoxsquared = xsquared << 1; z[i] = twoxsquared + y[i]; }
The statements within the iteration must be executed sequentially because of the order of operation in arithmetic. However, each computation of z[i] is independent of the others. Therefore, we may structure the loop such that partial computations of various z values are performed sequentially. This will then allow the compiler or CPU to partially parallelize the resulting instructions.
We begin by unrolling the loop and arranging the iterations into a pipeline:
xs = x[i] * x[i]; | ||
txs = xs < < 1; | xs = x[i+1] * x[i+1]; | |
z[i] = txs + y[i]; | txs = xs < < 1; | xs = x[i+2] * x[i+2]; |
… | z[i+1] = txs + y[i+1]; | txs = xs < < 1; |
… | z[i+2] = txs + y[i+2]; |
Notice on the third row that we essentially have the original loop statements, but in reverse. This is where the instruction-level parallelism comes into play; these statements are independent of each other. This is our new loop. With the addition of some starter and finishing code, the end result is as follows:
xs = x[0] * x[0]; txs = xs < < 1; xs = x[1] * x[1]; for ( i = 0; i < N-2; i++ ) { z[i] = txs + y[i]; txs = xs << 1; xs = x[i+2] * x[i+2]; } z[N-2] = txs + y[N-2]; txs = xs << 1; z[N-1] = txs + y[N-1];
These are not entirely independent because xs and txs are shared among iterations. However, this is better than before because the index computations may be manipulated in parallel. Furthermore, it allows for a sort of pre-fetching against array elements.
There is another caveat regarding this loop in that the array must have at least two elements. Should there be any possibilities to the contrary, the user must code a special case.
Software pipelining can really mangle code, as this example shows. Thus, for maintainability, such a tool should be one of the last optimizations a programmer considers. But for cases when the processor absolutely must be saturated on every core, removing intra-loop dependencies in this manner is one of the most effective techniques available.