In this video from the Univa Breakfast Briefing at ISC 2018, Duncan Poole from NVIDIA describes how the company is accelerating HPC in the Cloud.
Today’s groundbreaking scientific discoveries are taking place in HPC data centers. Using containers, researchers and scientists gain the flexibility to run HPC application containers on NVIDIA Volta-powered systems including Quadro-powered workstations, NVIDIA DGX Systems, and HPC clusters.
The rise of AI workloads on GPU-enabled systems like RAIDEN in Japan introduces a corresponding and compelling demand for Univa Grid Engine. Aside from being the de facto standard for enterprise-class deployments of shared computational infrastructures for managed HPC and AI workloads, Univa Grid Engine delivers industry-leading integrations with NVIDIA GPUs and Docker containers:
- Through the differentiating abstraction of resource maps (RSMAPS), isolated to densely packed ‘collections’ of GPUsare identified, used, monitored and reported upon for their computational capabilities. Thus Deep Learning frameworks such as distributed TensorFlow can employ GPUs in executing applications and workflows.
- Through use of (optionally cached) images from the public Docker Hub or a private registry, containerized applications execute along traditional lines – meaning that they are controlled, limited, accounted for, etc., in precisely the samefashion as traditional (i.e., non-containerized) applications.
In the case of workload management for RAIDEN, Univa Grid Engine delivers combined support for GPUs and Docker containers – meaning that AIP Center researchers can run their Deep Learning applications within Docker containers that make abstracted use of ‘external’ GPUs via device mappings (i.e., between a container and a physical host). To ensure highly reproducible results, these mappings can be bound in a fashion that both optimizes and guarantees allocations – even in the case of this shared environment where a multitude of AI applications compete for RAIDEN’s resources in real time.
Univa Grid Engine software is an industry-leading distributed resource management (DRM) system used by hundreds of companies worldwide to build large compute cluster infrastructures for processing massive volumes of workload. A highly scalable and reliable DRM system, Grid Engine software enables companies to produce higher-quality products, reduce time to market, and streamline and simplify the computing environment.