1000x Faster Deep-Learning at Petascale Using Intel Xeon Phi Processors

Print Friendly, PDF & Email
Intel Xeon Phi Processors

The following guest article from Intel explores how Intel Xeon Phi Processors can work to accelerate deep learning. 

Intel Xeon Phi Processors

A cumulative effort over several years to scale the training of deep-learning neural networks has resulted in the first demonstration of petascale deep-learning training performance, and further to deliver this performance when solving real science problems1. The result reflects the combined efforts of NERSC (National Energy Research Scientific Computing Center), Stanford and Intel to solve real world use cases rather than simply report on performance benchmarks.

Scaling deep-learning training from single-digit nodes just a couple of years back to almost 10,000 nodes now, adding up to more than ten petaflop/s is big news. – Pradeep Dubey, Intel Fellow, Director Intel Labs, Parallel Computing Lab

When discussing the recently reported 15 petaFLOPS-per-second (PF/s) training performance on the Intel Xeon Phi powered Cori supercomputer nodes at NERSC, Pradeep Dubey (Intel Fellow, Director Intel Labs, Parallel Computing Lab) observed, “Scaling deep-learning training from single-digit nodes just a couple of years back to almost 10,000 nodes now, adding up to more than ten petaflop/s is big news.” This work represents a significant advance and performance milestone as even recently reported scaling behavior of deep-learning applications using TensorFlow, Caffe, and other popular packages was limited to a few tens of nodes. Dubey continued, “Our goal was to scale deep learning neural network training performance to enter the petaflop/s club.” He further observed that these results show that Intel Optimization of Caffe* – in combination with the right algorithms and software infrastructure – can deliver petascale levels of performance using approximately 9,600 of the Cori computational nodes featuring Intel Xeon Phi processors.

The results and methodology are published in Deep Learning at 15PF: Supervised and Semi-Supervised Classification for Scientific Data. Dubey believes the results in the paper can be improved by 2x or more, but he said, “A speedup from 15 PF/s to 30 PF/s is not as big a deal as the 1000x increase we already achieved through scaling”.

Putting this in perspective: Achieving 1000x better scaling took multiple years of work by bright people at multiple organizations. A 2x improvement reflects more of an optimization effort than a new breakthrough.

What is satisfying is that we achieved our petascale performance on real problems. – Pradeep Dubey, Intel Fellow, Director Intel Labs, Parallel Computing Lab

“What is satisfying,” Dubey states, “is that we achieved our petascale performance on real problems, not simply on benchmark workloads” as the team chose the more difficult challenge of scaling to the petascale training of deep learning neural networks with real climate and high energy physics data sets, rather than an artificial benchmark such as AlexNet.. The choice reflects the team’s desire to reach a technological performance milestone while simultaneously benefiting the scientific community.

Publication of the first petascale deep-learning results represents a phase transition for everyone interested in using deep-learning, regardless if they run on Intel Xeon Phi or Intel Xeon processors. Dubey notes that all of the software components can run on either Intel Xeon and Intel Xeon Phi processors without modification, and can communicate over existing network fabrics such as Ethernet, Infiniband, and Intel Omni-Path Architecture. Thus, the team’s methodology can run pretty much anywhere, including in the cloud or on a local cluster.

Once the training run is specified, humans no longer participate in the training process, which means the software defines how “smart” the computer is as a student. To use a simple analogy, the NERSC Cori supercomputer is, as of this moment, the smartest deep-learning supercomputer in the world, quite arguably, by three orders of magnitude. The approach and methodology presented in the paper can be adapted so other Xeon and Xeon Phi processor-powered supercomputers can quickly become as “smart” as the Cori nodes.

Hear Narayanan Sundaram, Intel, and Thorsten Kurth, Lawrence Berkeley Laboratory and NERSC HPC consultant, present the details at the Intel HPC Developers Conference just prior to SC17.

1 Deep Learning at 15PF: Supervised and Semi-Supervised Classification for Scientific Data