What is loop-level parallelism?

Print Friendly, PDF & Email

Most high-performance compilers aim to parallelize loops to speed-up technical codes. Automatic parallelization is possible but extremely difficult because the semantics of the sequential program may change. Therefore, most users provide some clues to the compiler. A very common method is to use a standard set of directives known as OpenMP, in which the user expresses which sections of the code are to be parallel via a pragma.

Ordinarily with OpenMP, the software is executed serially by a single master thread. Upon reaching a parallel loop, slave threads are spawned to perform some iterations. When the loop completes, the threads are rejoined to the master, which continues executing alone and sequentially. An important trait of this master / slave model is that the program’s execution is separated from the system resources. There are a few rules about what can be parallelized though. For example, the loops must have a deterministically countable number of iterations. That is, the exact number of iterations must be determined before the loop executes.

Within the loop, the user specifies the scope of the variables. Variables may be shared across all threads, or may have private allocations within each thread. The only communication between two threads then is through one of these shared variables. By default, OpenMP will privatize the index of the outermost loop and leave all other variables as shared. Correctness is left to the user.

To handle race conditions, the user expresses where critical sections are. These sections are abstractions for mutex locks and are usually sufficient for synchronization. Dealing with dependencies, however, is left to the user.

OpenMP has a few mechanisms for handling load imbalance. By default, the scheduling of loop iterations among the threads is performed statically. However, the user may request that the schedule be dynamic, in which case a fixed-size chunk of iterations is assigned at runtime. A guided self-scheduling using a varying chunk size is also available.

While these features make OpenMP a great tool for SMPs, because there is no means for directing data placement—either statically or dynamically—OpenMP has poor performance on NUMAs. Some vendors have added language extensions that resemble HPF’s directives for data distribution. Others have experimented with sophisticated page migration routines that move data closer to the needing thread. None of these schemes are standard as of yet.

There have also been recommendations to privatize as much data as possible and only use shared data to communicate between threads. But one has to wonder whether this simply degenerates to programming like MPI. For now, there is no clear solution to using OpenMP beyond SMPs.