The European PRACE initiative has published a Best Practices Guide for GPU Computing.
“Graphics Processing Units (GPUs) were originally developed for computer gaming and other graphical tasks, but for many years have been exploited for general purpose computing across a number of areas. They offer advantages over traditional CPUs because they have greater computational capability, and use high-bandwidth memory systems (where memory bandwidth is the main bottleneck for many scientific applications). This Best Practice Guide describes GPUs: it includes information on how to get started with programming GPUs, which cannot be used in isolation but as “accelerators” in conjunction with CPUs, and how to get good performance. Focus is given to NVIDIA GPUs, which are most widespread today.”
Many of the following best practice recommendations aiming for performance portability:
- Keep data resistant on the device, since the copying of data between the host and the device is a bottle nack.
- Overlap data transfers with computations on the host or the device using asynchronous data transfers.
- Use
schedule(static, 1)
for distributing threads using the Clang compiler (default for Cray, not supported by GCC). - Prefer to include the most extensive combined construct relevant for the loop nest, i.e.
#pragma omp target teams distribute parallel for simd
(however not available in GCC 6.1). - Always include
parallel for
, andteams
anddistribute
, even if the compiler does not require them. - Include the
simd
directive above the loop you require to be vectorised. - Neither
collapse
norschedule
should harm functional protability, but might inhibit performance portability. - Avoid setting
num_teams
andthread_limit
since each compiler uses different schemes for scheduling teams to a device.
Sign up for our insideHPC Newsletter