Scaling Deep Learning Algorithms on Extreme Scale Architectures

In this video from the MVAPICH User Group, Abhinav Vishnu from PNNL presents: Scaling Deep Learning Algorithms on Extreme Scale Architectures.

“Deep Learning (DL) is ubiquitous. Yet leveraging distributed memory systems for DL algorithms is incredibly hard. In this talk, we will present approaches to bridge this critical gap. We will start by scaling DL algorithms on large scale systems such as leadership class facilities (LCFs). Specifically, we will: 1) present our TensorFlow and Keras runtime extensions which require negligible changes in user-code for scaling DL implementations, 2) present communication-reducing/avoiding techniques for scaling DL implementations, 3) present approaches on fault tolerant DL implementations, and 4) present research on semi-automatic pruning of DNN topologies. Our results will include validation on several US supercomputer sites such as Berkeley’s NERSC, Oak Ridge Leadership Class Facility, and PNNL Institutional Computing. We will provide pointers and discussion on the general availability of our research under the umbrella of Machine Learning Toolkit on Extreme Scale (MaTEx) available at http://github.com/matex-org/matex.”

Abhinav Vishnu is a chief scientist and team lead for scalable machine learning at Pacific Northwest National Laboratory. He focuses on designing extreme scale Deep Learning algorithms which are capable of execution on supercomputers and cloud computing systems. The specific objectives are to design user-transparent distributed TensorFlow; novel communication reducing/approximation techniques for DL algorithms; fault tolerant Deep Learning/Machine Learning algorithms; multi-dimensional deep neural networks and applications of these techniques on several domains such as high energy physics, computational chemistry and general computer vision tasks. His research is publicly available as Machine Learning Toolkit for Extreme Scale (MaTEx) at http://github.org/matex-org/matex.