In this video from the HPC Advisory Council Swiss Conference 2014, Sadaf Alam from CSCS presents: Monitoring and Management Interfaces for GPU Devices in a Cluster Environment.
This presentation will provide an overview of the Nvidia Tesla Deployment Kit (TDK) from a user and a system administrator point of view. TDL contains Nvidia Management Library (NVML) and nvidia-healthmon–a tool for detecting and troubleshooting known GPU issues in a cluster environment. Usage models within a cluster environment will be presented along with a discussion on how existing resource management tools can be extended to improve allocation and accounting of GPU resources.”
Sign up for our insideHPC Newsletter.