This interview appears courtesy of The Exascale Report.
Discussions around Titan always circle back around to the tools and the programming environment. As the systems get more complex, we all agree there will be an increasing need for enhanced compilers and development tools. At the core of this discussion, we always hear reference to the same company – The Portland Group.
It’s no surprise to many that PGI compilers will be an integral component of the user application development environment, not just for Titan, but for many generations of systems to come.
To talk about the compiler heartbeat of this new system and the changing user environments of larger petascale deployments, The Exascale Report talked with the Director of The Portland Group, Doug Miles,
The Exascale Report: Most Cray systems include PGI compilers. Will PGI compilers be available on Titan?
MILES: Yes, in fact Oak Ridge upgraded some time ago to our GPU-enabled PGI Accelerator compiler suite to enable early software work and preparation for Titan. By the time Titan is installed, the latest versions of the PGI compilers, including support for AVX and the new AMD Interlagos CPUs and the latest NVIDIA GPUs will already be in place.
TER: Titan will represent an entirely new level of complexity in terms of the programming and application development environment. What is being done today to ensure the most efficient development environments possible, with Titan and other platforms, as we start to deploy these massive and complex systems?
MILES: You are right that this is a new level of complexity. In addition to the distributed-memory dimension and the SMP dimension, developers now need to consider the accelerator dimension when porting and optimizing applications. Fortunately, with the right compiler and tools support, it is all very tractable. PGI’s approach is based on 3 elements: aggressive x86 optimization and OpenMP directives for efficient use of CPU resources, GPU directives to make it easy to get started accelerating applications with GPUs, and native CUDA Fortran/C/C++ language support for full control over GPU porting and optimization. Obviously debugging will also be a key factor, and companies like Allinea and Totalview are doing great work in that area and we’ll be working with them to ensure our tools are interoperable.
TER: What are the biggest considerations / challenges users will face in moving from Jaguar to Titan?
MILES: In moving from Jaguar to Titan, one factor that stands out is the sheer number of cores that will be available in the full-up Titan system – into the millions if you count each CUDA core individually. Developers and users will need to either run much larger problems or expose orders of magnitude more parallelism in their algorithms in order to use the system efficiently.
Another challenge will be to minimize the cost of data movement between CPU memory and GPU memory. We’ve added features to the PGI Accelerator compilers in the past year that enable developers to allocate data in GPU memory and re-use it across kernel invocations and even up and down procedure call chains. That is a big step forward in enabling whole program data management between CPU memory and device memory, which doesn’t eliminate all data movement but ensures it can be minimized.
Another issue in the move from Jaguar to Titan will be effectively using CPU and GPU resources *together*. You don’t want to write code in a way that leaves 16 very powerful CPU cores idle while the GPUs are in use. Users will need to determine which parts of their applications are most suitable for GPU acceleration, which parts are most suitable for execution on the CPU cores, and structure their codes to maximize use of all the available computing resources. Programming in CUDA (C/C++ or Fortran), this is supported by default in the programming model.
The PGI Accelerator compilers will support asynchronous execution of GPU regions and CPU regions by early next year.
TER: Has the criticism regarding the difficulty in programming the GPU been fair and accurate – and what is PGI’s perspective on this?
MILES: Look, NVIDIA did a beautiful job of opening up GPUs for general-purpose parallel computation using CUDA. It was just about the right mix of API and language extensions to allow skilled programmers to extract the performance of GPUs on a wide variety of algorithms. If they had gone completely with an API approach, a la OpenCL, it would have been so low-level as to be almost unusable by scientists and engineers. If they had taken a purely high-level language-based approach out of the gate, they would have been betting the success or failure of GPGPU on some extremely complex and advanced compiler technology. I think we’ve all seen cases before where that has not panned out so well.
All that said, the barriers to getting started with GPUs can and are being lowered rapidly. We’re at the point in the adoption curve where higher-level programming models for GPUs are appearing from all sides – PGI with OpenMP- like directives, Microsoft with AMP, the OpenCL committee with their HLM initiatives, and even Google’s Renderscript for Android is touted as a way to write portable GPGPU code for mobile devices. All of these are recognition of the fundamental value of GPGPU, and all will likely make contributions in certain disciplines to ensure that GPGPU programming becomes progressively easier and more mainstream for all types of developers.
TER: What role will PGI compilers play in creating useable, productive environments as we move into 20 Petaflops and larger systems?
MILES: Back when we started working on the PGI Accelerator directive-based GPU compilers in early 2008, we made a bet that accelerators would become a mainstream technology in technical computing. Our goal in the race to 20 PFLOPs and beyond is to deliver compilers and tools that allow PGI users to extract the full performance potential of many-core and accelerator-based systems without getting locked in to any specific architecture or platform. We do of course aggressively react to demand, and in that sense we do put more emphasis on some platforms than others, but fundamentally PGI is agnostic regarding processors and operating systems. That objectivity is what gives PGI its unique position in the HPC market, and why we have historically been a bellweather on HPC market directions – from microprocessor-based scalable systems, to Linux/x86 clusters, to the rise of x86-64, and now toward GPGPU and beyond. I think you’ll continue to see us play that role on the inexorable march toward Exascale systems.