This sponsored post from Intel explores how to effectively train and execute machine learning and deep learning projects on CPUs.
Whatever the platform, getting the best possible performance out of an application always presents big challenges. This is especially true when developing AI and machine learning applications on CPUs. As with developing any high-performance application, we need to be aware of the unique optimization features that the latest processors bring to the table. If we don’t take advantage of the advanced performance features of the latest architectures, our applications will not achieve the performance we expect.
In a recent article in The Parallel Universe magazine, developers at Intel describe how the Intel Math Kernel Library for Deep Neural Networks (Intel MKL-DNN) accelerates machine learning apps built within popular deep learning frameworks such as Caffe, TensorFlow and MXNet. The library provides a number of commonly used deep learning primitives that have been highly optimized for the latest Intel platforms.
For example,
- Refactoring the code to take advantage of modern vector instructions. This means making sure all the most important primitives, convolution, matrix multiplication, and batch normalization, are vectorized for the latest SIMD instructions (AVX2 for Intel Xeon processors and AVX512 for Intel Xeon Phi processors).
- Achieving maximum performance requires using all the available cores efficiently. That means obtaining high levels of parallelization within a given layer or operation as well as across layers.
- Using prefetching, cache blocking techniques, and data formats that promote spatial and temporal locality so that data is always available when the execution units need it.
Building and linking these frameworks with MKL-DNN will automatically take advantage of the hardware and software optimizations of the latest Intel Xeon Scalable processors without code modifications.
The optimized features of MKL-DNN include:
Open source MKL-DNN contains C and C++ interfaces to vectorized and threaded building blocks you can use to implement a number of deep neural network functions. The library can be installed from the source code distribution on GitHub under the Apache License Version 2.0. See the latest build instructions for Linux, macOS, and Windows. MKL-DNN is fully integrated with Intel VTune Amplifier performance analyzer.
There are individual MKL-DNN installation and optimization guides for TensorFlow, Caffe, and MXNet.
Intel optimized TensorFlow is distributed through PIP, Anaconda, and Docker channels, and can also be built directly from source. Some of the optimizations specific to TensorFlow* on Intel architectures are described in an article.
For Caffe*, Intel provides a tutorial describing how to use Intel Optimization for Caffe to build Caffe optimized for Intel architecture, train deep network models using one or more compute nodes, and deploy networks.
With MXNet*, Intel has a tutorial explaining Intel Optimization for Apache MXNet.
The Parallel Universe magazine article also gives tips on the best runtime options settings for these deep learning frameworks. For example, various runtime options can greatly affect TensorFlow performance. Understanding the runtime options can ensure you get the best performance out of Intel’s optimizations. Data layout, how multidimensional arrays are stored and accessed in memory, and the efficient use of cache, also greatly impacts overall performance.
To express parallelism, MKL-DNN uses OpenMP directives, controlled by various environment variables. The article recommends several settings to optimize the performance of the OpenMP runtime library beyond the default values.
Significant performance gains can be achievable by using the Intel optimized frameworks to accelerate your deep learning workloads on the CPU.
Free download: Intel Math Kernel Library for Deep Neural Networks.