Sign up for our newsletter and get the latest HPC news and analysis.
Send me information from insideHPC:


HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD

William Beaudin is Senior Director of Engineering at DDN.

In this video from GTC Digital, William Beaudin from DDN presents: HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD.

Enabling high performance computing through the use of GPUs requires an incredible amount of IO to sustain application performance. We’ll cover architectures that enable extremely scalable applications through the use of NVIDIA’s SuperPOD and DDN’s A3I systems.

The NVIDIA DGX SuperPOD is a first-of-its-kind artificial intelligence (AI) supercomputing infrastructure. DDN A³I with the EXA5 parallel file system is a turnkey, AI data storage infrastructure for rapid deployment, featuring faster performance, effortless scale, and simplified operations through deeper integration. The combined solution delivers groundbreaking performance, deploys in weeks as a fully integrated system, and is designed to solve the world’s most challenging AI problems.

The groundbreaking performance delivered by the DGX SuperPOD enables the rapid training of deep learning models at great scale. To create the most accurate image classification, object detection, and natural language models require large amounts of training data. This data must be accessed rapidly across the entire SuperPOD. To maximize the computational capabilities of the DGX SuperPOD, it is essential to pair the DGX SuperPOD with a storage system fitted to the task.

In this paper, the DDN A³I AI400 appliance is evaluated for suitability for supporting deep learning (DL) workloads when connected to the DGX SuperPOD. The AI400 appliance is a compact and low-power storage solution that provides incredible raw performance with the use of NVMe drives for storage and InfiniBand as its network transport. The AI400 appliance leverages the EXAScaler file system which provides an enterprise version of the Lustre parallel filesystem which features increased hardening and additional data management capabilities. Parallel filesystems such as Lustre simplify data access and support additional use cases where fast data is required for efficient training and local caching is not adequate.

Sign up for our insideHPC Newsletter

Resource Links: