Scaling Deep Learning for Scientific Workloads on the #1 Summit Supercomputer

Jack Wells is the Director of Science for the Oak Ridge Leadership Computing Facility (OLCF).

In this video from GTC 2018, Jack Wells from ORNL presents: Scaling Deep Learning for Scientific Workloads on Summit.

HPC centers have been traditionally configured for simulation workloads, but deep learning has been increasingly applied alongside simulation on scientific datasets. These frameworks do not always fit well with job schedulers, large parallel file systems, and MPI backends. We’ll discuss examples of how deep learning workflows are being deployed on next-generation systems at the Oak Ridge Leadership Computing Facility. We’ll share benchmarks between native compiled versus containers on Power systems, like Summit, as well as best practices for deploying learning and models on HPC resources on scientific workflows.

The biggest problems in science require supercomputers of unprecedented capability. That’s why the US Department of Energy’s Oak Ridge National Laboratory (ORNL) launched Summit, a system 8 times more powerful than ORNL’s previous top-ranked system Titan. Summit is providing scientists with incredible computing power to solve challenges in energy, artificial intelligence, human health, and other research areas, that were simply out of reach until now. These discoveries will help shape our understanding of the universe, bolster US economic competitiveness, and contribute to a better future.

Buddy Bland shows off Summit, the world’s fastest supercomputer at ORNL.

Summit Specifications:

Application Performance: 200 PF (currently #1 on the TOP500)
Number of Nodes: 4,608
Node performance: 42 TF
Memory per Node: 512 GB DDR4 + 96 GB HBM2
NV memory per Node: 1600 GB
Total System Memory: >10 PB DDR4 + HBM2 + Non-volatile
Processors:
  • 2 IBM POWER9 9,216 CPUs
  • 6 NVIDIA Volta 27,648 GPUs
File System: 250 PB, 2.5 TB/s, GPFS
Power Consumption: 13 MW
Interconnect: Mellanox EDR 100G InfiniBand
Operating System: Red Hat Enterprise Linux (RHEL) version 7.4

 
Jack Wells is the Director of Science for the Oak Ridge Leadership Computing Facility (OLCF), a DOE Office of Science national user facility, and the Titan supercomputer, located at Oak Ridge National Laboratory (ORNL). Wells is responsible for the scientific outcomes of the OLCF’s user programs. Wells has previously lead both ORNL’s Computational Materials Sciences group in the Computer Science and Mathematics Division and the Nanomaterials Theory Institute in the Center for Nanophase Materials Sciences. Prior to joining ORNL as a Wigner Fellow in 1997, Wells was a postdoctoral fellow within the Institute for Theoretical Atomic and Molecular Physics at the Harvard-Smithsonian Center for Astrophysics. Wells has a Ph.D. in physics from Vanderbilt University, and has authored or co-authored over 100 scientific papers and edited 1 book, spanning nanoscience, materials science and engineering, nuclear and atomic physics computational science, applied mathematics, and novel analytics measuring the impact of scientific publications.

Check out our insideHPC Events Calendar