What is data-parallel programming?

In the task-parallel model represented by OpenMP, the user specifies the distribution of iterations among processors and then the data travels to the computations. In data-parallel programming, the user specifies the distribution of arrays among processors, and then only those processors owning the data will perform the computation. In OpenMP’s master / slave approach, all code is executed sequentially on one processor by default. In data-parallel programming, all code is executed on every processor in parallel by default.

The most widely used standard set of extensions for data-parallel programming are those of High Performance Fortran (HPF). With HPF, a user declares how to DISTRIBUTE data among abstract processors, usually in a BLOCK or CYCLIC fashion, the former intended for applications with nearest-neighbor communication, and the latter for load-balancing purposes. Additionally, the user may also ALIGN data elements with each other. The elements of arrays that have thus been mapped will be assigned to exactly one processor; all other (non-mapped) data is copied to each of the processors.

To parallelize a loop, the user declares its iterations as being INDEPENDENT. Data within the loop that has been given the NEW attribute will remain private in scope; all other data is copied to each processor. Correctness is left to the user.

While HPF has been designed for NUMAs, its performance in practice has been unpredictable. The reason does not appear to be the communication, but rather the data manipulation that occurs before and after the communication. It is important here to observe that the performance impact of modern high-performance networks is actually less of an issue compared to the software’s overhead.

Another key observation to make at this time is that of the “optimization envy” of the parallelizing-compiler groups. Just as some OpenMP users seek to exploit NUMAs, there are a few HPF users who hope to achieve better performance on SMPs. One approach has been to include DYNAMIC and GUIDED attributes for the SHARE clause associated with INDEPENDENT. The compiler then produces multithreaded object code, rather than message-passing code. An important difference with this approach from the standard is that non-mapped data is owned by a single master thread and is globally accessible by all other threads. The performance benefit of this approach is that there is no software translation of the global address as the hardware already provides this support. Address translation is a major source of overhead during runtime.