To reconcile the parallelism in both tasks (OpenMP) and data (HPF), a different programming model has emerged that claims to have the best of both. It is the partitioned global address space model, and has been applied a variety of languages, the most widely used being Unified Parallel C (UPC).
In UPC, by default, all variables are private and every instruction is executed by each thread. However, the user may declare that some data be shared and may specify its distribution in either block or cyclic form. Furthermore, the user may divide the iterations of a parallel loop either based on data ownership (like HPF) or in a thread-specific manner (as in OpenMP). UPC also has synchronization mechanisms and allows the user to choose between strict or relaxed memory consistency.
While UPC’s constructs appear to be well-suited for parallel programming, its performance has not lived up to expectations as of yet. One major cause of concern is the address translation. Ordinary pointer manipulation requires integer arithmetic; manipulating the pointers in UPC require much more computation. The runtime system will not know where a shared array’s elements lie until after the translation has been made, even if the element in question resides in local memory. The performance studies indicate that the time required to access a locally stored element through a shared array is close to the time required to retrieve a remotely stored element! To access the locally stored element in a smaller amount of time, the user must cast the element as private.
Given these results, one might wonder if an SMP or a ccNUMA might be a better target platform for UPC. One study on the matter did demonstrate substantially better access times when forgoing translation in favor of simple loads and stores.