How to Achieve High-Performance, Scalable and Distributed DNN Training on Modern HPC Systems

Print Friendly, PDF & Email

DK Panda, Ohio State University

In this video from the Stanford HPC Conference, DK Panda from Ohio State University presents: How to Achieve High-Performance, Scalable and Distributed DNN Training on Modern HPC Systems.

This talk will start with an overview of challenges being faced by the AI community to achieve high-performance, scalable and distributed DNN training on Modern HPC systems with both scale-up and scale-out strategies. After that, the talk will focus on a range of solutions being carried out in my group to address these challenges. The solutions will include: 1) MPI-driven Deep Learning, 2) Co-designing Deep Learning Stacks with High-Performance MPI, 3) Out-of- core DNN training, and 4) Hybrid (Data and Model) parallelism. Case studies to accelerate DNN training with popular frameworks like TensorFlow, PyTorch, MXNet and Caffe on modern HPC systems will be presented.”

DK Panda is a Professor and Distinguished Scholar of Computer Science at the Ohio State University. Dr. Panda and his research group members have been doing extensive research on modern networking technologies including InfiniBand, Omni-Path, iWARP, AWS EFA, and RoCE. His research group is currently collaborating with National Laboratories and leading InfiniBand, Omni-Path, iWARP and RoCE companies on designing various subsystems of next generation high-end systems. The MVAPICH (High Performance MPI and MPI+PGAS over InfiniBand, iWARP and RoCE with support for GPGPUs, Xeon Phis and Virtualization) software libraries , developed by his research group, are currently being used by more than 3,075 organizations worldwide (in 89 countries). These software packages have enabled several InfiniBand clusters to get into the latest TOP500 ranking. More than 708,000 downloads of this software have taken place from the project website alone. These software packages are also available with the software stacks for network vendors (InfiniBand, Omni-Path, RoCE, AWS EFA and iWARP), server vendors (OpenHPC), and Linux distributors (such as RedHat and SuSE).

See more talks from the Stanford HPC Conference

Check out our insideHPC Events Calendar