“Graphics Processing Units (GPUs) were originally developed for computer gaming and other graphical tasks, but for many years have been exploited for general purpose computing across a number of areas. They offer advantages over traditional CPUs because they have greater computational capability, and use high-bandwidth memory systems (where memory bandwidth is the main bottleneck for many scientific applications). This Best Practice Guide describes GPUs: it includes information on how to get started with programming GPUs, which cannot be used in isolation but as “accelerators” in conjunction with CPUs, and how to get good performance. Focus is given to NVIDIA GPUs, which are most widespread today.”
Many of the following best practice recommendations aiming for performance portability:
- Keep data resistant on the device, since the copying of data between the host and the device is a bottle nack.
- Overlap data transfers with computations on the host or the device using asynchronous data transfers.
schedule(static, 1)for distributing threads using the Clang compiler (default for Cray, not supported by GCC).
- Prefer to include the most extensive combined construct relevant for the loop nest, i.e.
#pragma omp target teams distribute parallel for simd(however not available in GCC 6.1).
- Always include
parallel for, and
distribute, even if the compiler does not require them.
- Include the
simddirective above the loop you require to be vectorised.
scheduleshould harm functional protability, but might inhibit performance portability.
- Avoid setting
thread_limitsince each compiler uses different schemes for scheduling teams to a device.