In this talk, we introduce the rCUDA remote GPU virtualization framework, which has been shown to be the only one that supports the most recent CUDA versions, in addition to leverage the InfiniBand interconnect for the sake of performance. Furthermore, we also present the last developments within this framework, related with the use of low-power processors, enhanced job schedulers, and virtual machine environments.”
“This presentation will provide an overview of the Nvidia Tesla Deployment Kit (TDK) from a user and a system administrator point of view. TDL contains Nvidia Management Library (NVML) and nvidia-healthmon–a tool for detecting and troubleshooting known GPU issues in a cluster environment. Usage models within a cluster environment will be presented along with a discussion on how existing resource management tools can be extended to improve allocation and accounting of GPU resources.”
“SLURM is an open-source workload manager designed for Linux clusters of all sizes. It provides workload management on many of the most powerful computers in the world and its design is very modular with dozens of optional plugins. This talk will present an overview of SLURM and an analysis of the Consumable Resource Allocation Plugin and its utilization in connection with GPUs.”
In this video from GTC 2014, Steve Oberlin from Nvidia describes his new role as Chief Technical Officer for Accelerated Computing. Along the way, he discusses: the HPC lessons learned from the CRAY T3E and other systems, Nvidia’s plans to tackle the challenges of the HPC Memory Wall, the current status on Project Denver, and how Nvidia plans to couple to the POWER architecture in future systems.