PRACE Publishes Best Practices for GPU Computing

Print Friendly, PDF & Email

The European PRACE initiative has published a Best Practices Guide for GPU Computing.

“Graphics Processing Units (GPUs) were originally developed for computer gaming and other graphical tasks, but for many years have been exploited for general purpose computing across a number of areas. They offer advantages over traditional CPUs because they have greater computational capability, and use high-bandwidth memory systems (where memory bandwidth is the main bottleneck for many scientific applications). This Best Practice Guide describes GPUs: it includes information on how to get started with programming GPUs, which cannot be used in isolation but as “accelerators” in conjunction with CPUs, and how to get good performance. Focus is given to NVIDIA GPUs, which are most widespread today.”

Many of the following best practice recommendations aiming for performance portability:

  1. Keep data resistant on the device, since the copying of data between the host and the device is a bottle nack.
  2. Overlap data transfers with computations on the host or the device using asynchronous data transfers.
  3. Use schedule(static, 1) for distributing threads using the Clang compiler (default for Cray, not supported by GCC).
  4. Prefer to include the most extensive combined construct relevant for the loop nest, i.e. #pragma omp target teams distribute parallel for simd (however not available in GCC 6.1).
  5. Always include parallel for, and teams and distribute, even if the compiler does not require them.
  6. Include the simd directive above the loop you require to be vectorised.
  7. Neither collapse nor schedule should harm functional protability, but might inhibit performance portability.
  8. Avoid setting num_teams and thread_limit since each compiler uses different schemes for scheduling teams to a device.

Sign up for our insideHPC Newsletter