Rice Univ. Researchers Claim 15x AI Model Training Speed-up Using CPUs

Reports are circulating in AI research circles that computer scientists at Rice University have achieved an advance in model training acceleration – without using accelerators. Running AI software on commodity x86 CPUs, the Rice computer science team say  neural networks can be trained 15x faster than platforms utilizing GPUs.

If valid, the new approach could be a double boon for organizations implementing AI strategies: faster model training using less costly microprocessors.

In collaboration with Intel, the biggest maker of commodity CPUs, the Rice researchers created an algorithm called “sub-linear deep learning engine” (SLIDE) that reduces the model training computational overhead. Standard training technique uses “back-propagation” involving matrix multiplication, for which GPUs are well suited. On the other hand, SLIDE – described by the researchers as “a C++ OpenMP based system that combines smart hashing randomized algorithms with modest multi-core parallelism on CPU” – converts the training task into a search problem solved with hash tables.

“Hash table-based acceleration already outperforms GPU, but CPUs are also evolving,” study co-author Shabnam Daghaghi, a Rice graduate student, told Jade Boyd of Rice University in the publication Tech Xplore. “We leveraged those innovations to take SLIDE even further, showing that if you aren’t fixated on matrix multiplications, you can leverage the power in modern CPUs and train AI models four to 15 times faster than the best specialized hardware alternative.”

“Our tests show that SLIDE is the first smart algorithmic implementation of deep learning on CPU that can outperform GPU hardware acceleration on industry-scale recommendation datasets with large fully connected architectures,” Anshumali Shrivastava, an assistant professor in Rice’s Brown School of Engineering told Science Daily last month.

In a paper published earlier this year, the research team attribute the speed-up of SLIDE to “several opportunities available in modern CPUs. In particular, we show how SLIDE’s computations allow for a unique possibility of vectorization via AVX (Advanced Vector Extensions)-512. Furthermore, we highlight opportunities for different kinds of memory optimization and quantizations. Combining all of them, we obtain up to 7x speedup in the computations on the same hardware. Our experiments are focused on large (hundreds of millions of parameters) recommendation and NLP models.”

Study co-author Nicholas Meisburger, a Rice undergraduate, said “CPUs are still the most prevalent hardware in computing. The benefits of making them more appealing for AI workloads cannot be understated.”

Shrivastava said the researcher conducted a comparison test, training the model using GPU technology vs. CPU. “We have one in the lab, and in our test case we took a workload that’s perfect for V100 (GPU), one with more than 100 million parameters in large, fully connected networks that fit in GPU memory,” he told Science Daily. “We trained it with the best (software) package out there, Google’s TensorFlow, and it took 3 1/2 hours to train.

“We then showed that our new algorithm can do the training in one hour, not on GPUs but on a 44-core Xeon-class CPU,” he said.

The Rice research team and Intel presented their findings earlier this month at the machine learning conference MLSys.

Comments

  1. is this a fair comparison? have they applied similar optimizations to the GPU?