This week Nvidia announced the latest update to their Cuda platform for parallel computing. To learn more, I caught up with Will Ramey, Nvidia’s Sr. Product Manager for GPU Computing.
insideHPC: When we talk about a new Cuda platform, are we talking about the Cuda Toolkit plus SDK? Does this new update have a version number?
insideHPC: Specifically, what components comprise the platform?
Will Ramey: There are 3 key components to this release (version 4.1):
- The CUDA Toolkit is a comprehensive development environment for C and C++ developers building GPU-accelerated applications. Version 4.1 of CUDA Toolkit includes a compiler for NVIDIA GPUs, math libraries, and tools for debugging and optimizing application performance. You’ll also find programming guides, user manuals, API reference, and other documentation to help programmers add GPU acceleration to their applications quickly. More info at: http://developer.nvidia.com/cuda-toolkit
- The CUDA Driver provides a system-level interface for CUDA applications to communicate with the GPUs, and is included in the NVIDIA drivers installer.
- NVIDIA also provides an SDK with over 100 GPU Computing SDK code samples, as well as white papers to help developers quickly add GPU acceleration to their applications. More info at: http://developer.nvidia.com/gpu-computing-sdk
Developers need to install the CUDA Toolkit to build CUDA applications, and the latest NVIDIA drivers so their applications can communicate with the GPUs in their system. Developers can also choose to install the SDK code samples to learn from the large collection of examples.
To run CUDA applications, end-users only need to install the latest NVIDIA drivers.
insideHPC: What is new within the updated platform?
Will Ramey: In addition to the new LLVM-based compiler that delivers up to 10 percent faster performance, there are a number of significant new features in this release:
- New & Improved “drop-in” acceleration with GPU-Accelerated Libraries
- Over 1000 new image processing functions in the NPP library
- New cuSPARSE tri-diagonal solver up to 10x faster than MKL on a 6 core CPU
- New support in cuRAND for MRG32k3a and Mersenne Twister (MTGP11213) RNG algorithms
- Bessel functions now supported in the CUDA standard Math library
- Up to 2x faster sparse matrix vector multiply using ELL hybrid format
- Enhanced & Redesigned Developer Tools
- Redesigned Visual Profiler with automated performance analysis and expert guidance system
- CUDA_GDB support for multi-context debugging and assert() in device code
- CUDA-MEMCHECK now detects out of bounds access for memory allocated in device code
- Parallel Nsight 2.1 CUDA warp watch visualizes variables and expressions across an entire CUDA warp
- Parallel Nsight 2.1 CUDA profiler now analyzes kernel memory activities, execution stalls and instruction throughput
- Advanced Programming Features
- Access to 3D surfaces and cube maps from device code
- Enhanced no-copy pinning of system memory, cudaHostRegister() alignment and size restrictions removed
- Peer-to-peer communication between processes
- Support for resetting a GPU without rebooting the system in nvidia-smi
- New & Improved SDK Code Samples
- simpleP2P sample now supports peer-to-peer communication with any Fermi GPU
- New grabcutNPP sample demonstrates interactive foreground extraction using iterated graph cuts
- New samples showing how to implement the Horn-Schunck Method for optical flow, perform volume filtering, and read cube map texture
insideHPC: How do the new components ease code development?
Will Ramey: The new LLVM-based compiler compiles code faster than the old compiler, increasing developer productivity. As you might expect, the compile-time saved varies by application, but we’ve seen some large applications compile more than 60 minutes faster than with the old compiler.
The NVIDIA Visual Profiler has been completely re-designed to streamline developers’ performance analysis workflow. The new automated performance analysis feature quickly identifies bottlenecks and opportunities to improve application performance, and is integrated with the “Best Practices” documentation guiding developers through the process of optimizing their applications. Developers can now achieve the full potential of GPU acceleration in their application with significantly less effort.
The new image & signal processing functions in NPP makes it easier for more developers to accelerate more of their algorithms on the GPU.
The new tri-diagonal solver in cuSPARSE allows developers to just call the pre-optimized version in the library instead of having to write their own.
insideHPC: How do the new components help speed developer code?
Will Ramey: The new LLVM-based compiler includes several new optimization techniques that allow the compiler to generate more efficient code. This is another case where the performance improvement will vary depending on the application, but we’re seeing up to 10 percent performance improvement across a variety of applications.
Using the new RNGs in cuRAND, image & signal processing functions in NPP, tri-diagonal solver in cuSPARSE, etc. all help developers quickly take advantage of pre-optimized routines that take full advantage of hundreds of cores on the GPU.
insideHPC: If I had the most current version of Cuda yesterday, what’s new that I can download today?
Will Ramey: Today you can download the new CUDA Toolkit, SDK code samples, and drivers. Available for Linux, MacOS and Windows.