Trains Neural Net in Record Time with Super Convergence

Print Friendly, PDF & Email

Over at the Fast.AI Blog, Jeremy Howard writes that his Startup has achieved an amazing deep learning benchmark milestone – the ability to do an ImageNet training in 3 hours for just $25. is a research lab dedicated to making deep learning more accessible, both through education, and developing software that simplifies access to current best practices. We do not believe that having the newest computer or the largest cluster is the key to success, but rather utilizing modern techniques and the latest research with a clear understanding of the problem we are trying to solve. As part of this research we recently developed a new library for training deep learning models based on Pytorch, called fastai.

Over time we’ve been incorporating into fastai algorithms from a number of research papers which we believe have been largely overlooked by the deep learning community. In particular, we’ve noticed a tendency of the community to over-emphasize results from high-profile organizations like Stanford, DeepMind, and OpenAI, whilst ignoring results from less high-status places. One particular example is Leslie Smith from the Naval Research Laboratory, and his recent discovery of an extraordinary phenomenon he calls super convergence. He showed that it is possible to train deep neural networks 5-10x faster than previously known methods, which has the potential to revolutionize the field.

In the study group we decided to focus on multi-GPU training, in order to get the fastest result we could on a single machine. In general, our view is that training models on multiple machines adds engineering and sysadmin complexity that should be avoided where possible, so we focus on methods that work well on a single machine. We used a library from NVIDIA called NCCL that works well with Pytorch to take advantage of multiple GPUs with minimal overhead.

About the Benchmark

DAWNBench is a Stanford University project designed to allow different deep learning methods to be compared by running a number of competitions. There were two parts of the Dawnbench competition that attracted our attention, the CIFAR 10 and Imagenet competitions. Their goal was simply to deliver the fastest image classifier as well as the cheapest one to achieve a certain accuracy (93% for Imagenet, 94% for CIFAR 10).

In the CIFAR 10 competition our entries won both training sections: fastest, and cheapest. Another student working independently, Ben Johnson, who works on the DARPA D3M program, came a close second in both sections.

In the Imagenet competition, our results were:

  • Fastest on publicly available infrastructure, fastest on GPUs, and fastest on a single machine (and faster than Intel’s entry that used a cluster of 128 machines!)
  • Lowest actual cost (although DAWNBench’s official results didn’t use our actual cost, as discussed below).

Overall, our findings were:

  • Algorithmic creativity is more important than bare-metal performance
  • Pytorch, developed by Facebook AI Research and a team of collaborators, allows for rapid iteration and debugging to support this kind of creativity
  • AWS spot instances are an excellent platform for rapidly and inexpensively running many experiments.

Most papers and discussions of multi-GPU training focus on the number of operations completed per second, rather than actually reporting how long it takes to train a network. However, we found that when training on multiple GPUs, our architectures showed very different results. There is clearly still much work to be done by the research community to really understand how to leverage multiple GPUs to get better end-to-end training results in practice. For instance, we found that training settings that worked well on single GPUs tended to lead to gradients blowing up on multiple GPUs. We incorporated all the recommendations from previous academic papers (which we’ll discuss in a future paper) and got some reasonable results, but we still weren’t really leveraging the full power of the machine. In the end, we found that to really leverage the 8 GPUs we had in the machine, we actually needed to give it more work to do in each batch—that is, we increased the number of activations in each layer.

Read the Full Story

Sign up for our insideHPC Newsletter