“Learn how to program NVIDIA GPUs using Fortran with OpenACC directives. The first half of this presentation will introduce OpenACC to new GPU and OpenACC programmers, providing the basic material necessary to start successfully using GPUs for your Fortran programs. The second half will be intermediate material, with more advanced hints and tips for Fortran programmers with larger applications that they want to accelerate with a GPU. Among the topics to be covered will be dynamic device data lifetimes, global data, procedure calls, derived type support, and much more.”
In this slidecast, Doug Miles from Nvidia describes the new features and performance gains in the PGI 2014 release. “The use of accelerators in high performance computing is now mainstream,” said Douglas Miles, director of PGI Software at Nvidia. “With PGI 2014, we are taking another big step toward our goal of providing platform-independent, multi-core and accelerator programming tools that deliver outstanding performance on multiple platforms without the need for extensive, device-specific tuning.”
We heard some very good reviews of a talk given by Doug Miles of The Portland Group at the bi-annual Clouds, Clusters and Data for Scientific Computing technical meeting outside of Lyon, France in mid- September. Most of the talks from that meeting are available online at the CCDSC 2012 website, but the PGI talk did not include any slides. PGI has provided the Exascale Report with a copy of the transcript from the talk, which we have reproduced here with a few minor edits.
Most of today’s CPU-only large-scale systems have a similar look-and-feel: many homogeneous nodes communicating via MPI, each node has a few identical processor chips, each chip has multiple identical cores, each core has some SIMD processing capability. Programming one such system is very like programming any other, regardless of chip vendor, number of total cores, number of cores per node, SIMD width or interconnect fabric.
Setting aside Accelerator-enabled systems for a minute, how did we get here? How did we reach this level of homogeneity from such heterogeneous HPC system roots? 25 or 30 years ago we had vector machines, VLIW machines, SMP machines, massively parallel SIMD machines, and literally scores of different instruction set architectures. How did systems become so homogeneous?