As I mentioned in a post here several weeks ago, I was struck by the diversity of vendors and partners exhibiting technology based on NVIDIA’s GPUs at ISC last month in Germany. At least 19 companies were offering products that use, facilitate, or support GPUs as part of a larger solution, including everyone from OS and library providers to HPC system vendors.
That’s a long list for a product that only relatively recently stepped onto the computation stage. Other accelerators have popped up from time to time over the past decade, but it seems to me that the development of a really diverse partner ecosystem may be one of the things that could help GPUs stick around as part of the HPC solution space. With companies integrating GPU hardware into their solutions, and other companies developing tools to make the GPUs themselves easier to use, GPUs are starting to benefit from a real network effect.
And that makes PGI’s recent announcement a good move for both them and for GPU users and providers. The Portland Group, Inc — or just PGI to their friends — is a long-time provider of compilers that focus on the HPC user community. In late June they announced that version 9.0 of their compiler suite has new capabilities that enable HPC users to get at the power of GPUs without all the pain typically associated with GPU programming.
While GPUs can offer significant speedups for certain classes of application over more traditional CPU-only parallel programming, they also present significant barriers to adoption. Even if an application is well-suited to the GPU model, making effective use of the added resources on a GPU can be tough. Using tools like CUDA on NVIDIA’s GPUs requires substantial effort on the part of application developers who now must explicitly manage the transfer of data to the processors of the GPU, fetching of the answer from the GPU, and restructuring of operations to take advantage of the various levels of parallel processing within the hardware (both vector and multiprocessor). OpenCL begins to address some of the concerns that developers have with CUDA in as much as OpenCL has the potential to be supported cross-platform, while CUDA is limited to NVIDIA products. However from a programming point of view OpenCL is still rather low level, requiring the developer to rewrite the computational kernel, allocate and explicitly manage device memory, and so on. OpenCL also does not directly address Fortran programming for accelerators, which makes it even harder to use with most HPC applications.
Release 9.0 of PGI’s compiler gives programmers access to the PGI Accelerator Programming model, an innovation that the company feels does “for GPU programming what OpenMP did for thread programming.” The idea is that programmers need only add directives to C and Fortran codes, and the compiler does the rest. Of course (and like OpenMP), letting the compiler “do the rest” will often get you something that works and is faster than what you started with, but you may still need to dig in there and help things along to get the best performance. PGI’s compiler helps you out with this, too, in the form of information messages that direct you back to specific lines of your code where it needs a little help from you in order to be able to get the best performance.
The nice thing about this approach is that it shifts management of data movement between CPUs and GPUs — ie, tasks not related to the actual function of the application — away from the human and back to the machine where it belongs. It also allows developers to make incremental changes to their applications, returning to iteratively refine key sections only as needed, while not breaking the original source tree on other platform/compiler combinations: PGI’s directives appear as comments to compilers that don’t recognize them.
So, what do these directives look like? PGI has a nice section of their web site devoted to providing information about the Accelerator Programming Model, including videos that compare the performance of various codes using their compiler for computations with and without GPU acceleration. But, to give you an example, here is a basic routine with the accelerator directives put in (the !$acc stuff)
!$acc region DO I = 1,N R(I) = SIN(A(I)) ** 2 + COS(A(I)) ** 2 ENDDO !$acc end region
From this the compiler will generate both the host code and the GPU code needed for the final application. If you wanted to take things further and tune the data movement and the kernel mapping, your code might look like this on a matrix multiply
!$acc region copyin(b(1:n,1:p),c(1:p,1:m)) !$acc& copy(a(1:n,1:m)) !$acc do parallel DO J = 1, M !$acc do seq DO K = 1, P !$acc do parallel, vector(128) DO I = 1, N A(I,J) = A(I,J) + B(I,K)*C(K,J) ENDDO ENDDO ENDDO !$acc end region
Of course these examples are in Fortran, but analogous changes apply for C programmers. Introducing these changes needn’t break the rest of your tool chain either, and other than the addition of a new -ta=xxx (“target accelerator,” where xxx today is nvidia) argument on your PGI compiler and link lines your code deployment workflow should be unchanged. PGI’s unified binary ensures that your binary can run on non-GPU enabled hardware as well.
Today PGI’s compiler supports NVIDIA’s GPUs, but the company believes the approach is general and doesn’t anticipate problems as it moves to support ATI cards, Cell blades, Intel’s future Larrabee platform, and others.
And for those already invested in CUDA, take heart, PGI is working on better performance for you too. Along with this announcement PGI and NVIDIA announced that they are teaming up to jointly develop a new Fortran compiler to support CUDA. This move will finally bring CUDA natively to the Fortran applications that still make up the lion’s share of the scientific and engineering computing application code base. Today Fortran developers have to code their CUDA kernels in C and link them to the rest of their Fortran application — not a great workflow. CUDA support is expected by November 2009 in PGI’s Fortran compilers (more in the press release).