TensorFlow Deep Learning Optimized for Modern Intel Architectures

Print Friendly, PDF & Email

Sponsored Post

Researchers at Google and Intel recently collaborated to extract the maximum performance from Intel® Xeon and Intel® Xeon Phi processors running TensorFlow*, a leading deep learning and machine learning framework. This effort resulted in significant performance gains and leads the way for ensuring similar gains from the next generation of products from Intel.

Optimizing Deep Neural Network (DNN) models such as TensorFlow presents challenges not unlike those encountered with more traditional High Performance Computing applications for science and industry:

  • Rewrite code to take advantage of the vector instructions in the latest architectures (AVX2 for Intel Xeon processors and AVX512 for Intel Xeon Phi processors)
  • Use all available cores efficiently by ensuring parallelization within an operation as well as across layers.
  • Make data available when needed by employing prefetching, cache blocking, and data layouts that promote locality and minimize latency.

To address these challenges for the AI community, and deep learning frameworks in particular, Intel has developed a set of primitives that implement common algorithms used in DNN applications that are optimized for the latest Intel architectures. In addition to high-performing matrix multiplication and convolution routines, the Intel® Math Kernel Library for Deep Neural Networks (Intel® MLK-DNN) includes:

  • Direct batched convolution
  • Inner product
  • Pooling: maximum, minimum, average
  • Normalization: local response normalization across channels (LRN), batch normalization
  • Activation: rectified linear unit (ReLU)
  • Data manipulation: multi-dimensional transposition (conversion), split, concat, sum and scale.

Researchers optimizing TensorFlow started by refactoring the code to leverage the Intel MKL-DNN primitives wherever possible. This enabled scalable performance on the target Intel architectures. But data layouts used by TensorFlow also had to be converted to take full advantage of the Intel DNN primitives. Methods were developed to minimize the overhead of conversion between data formats internally without requiring TensorFlow users to change their existing models.[clickToTweet tweet=”Intel® MKL-DNN primitives improve TensorFlow* deep learning performance many fold. ” quote=”Intel® MKL-DNN primitives improve TensorFlow* deep learning performance many fold. “]

Integrating Intel MKL-DNN primitives into the TensorFlow code made possible a number of optimizations that provided performance gains without introducing any additional burden on TensorFlow users. In particular, the native TensorFlow data format was found to be not the most efficient data layout for optimizing CPU utilization with certain tensor operations. Adding conversion operations would introduce a performance overhead, so such conversions should be done sparingly. A data layout optimization was developed that identifies operations on sub-graphs that can be performed using Intel MKL-DNN primitives without conversion. Automatically inserted conversion nodes take care of data layout conversions at the boundaries of the sub-graph.

Other optimizations to TensorFlow components resulted in significant CPU performance gains for various deep learning models. Using the Intel MKL imalloc routine, both TensorFlow and the Intel MKL-DNN primitives were able to share the same memory pools, avoiding returning memory prematurely and costly page misses. Threading was also tuned so that pthreads used by TensorFlow, and OpenMP used by Intel MKL routines were able to coexist and not compete for CPU resources.

By optimizing TensorFlow with the Intel MKL-DNN primitives means that deep learning applications built using this widely available framework can now run much faster on the latest Intel processors. The scalability of the Intel Xeon Phi processor enables applications to scale out in a near-linear fashion across cores and nodes, reducing the time to train machine learning models. As a result, TensorFlow can now scale with future performance advancements to handle even bigger and more challenging AI workloads for business, science, engineering, medicine, and society.

The Intel® Math Kernel Library for Deep Neural Networks (Intel® MLK-DNN) is part of the Intel® Math Kernel Library (Intel® MKL), and is included in the Intel® Parallel Studio XE 2017.

Download your free 30-day trial of Intel® Parallel Studio XE 2017.