With the advent of massively parallel computing coprocessors, numerical optimization for deep-learning disciplines is now possible. Complex real-time pattern recognition, for example, that can be used for self driving cars and augmented reality can be developed and high performance achieved with the use of specialized, highly tuned libraries. By just using the Message Passing Interface (MPI) API, very high performance can be attained on hundreds to thousands of Intel Xeon Phi coprocessors.
Data-driven model fitting can be phrased as a form of function optimization where a set of model parameters are adjusted to fit some data set with a minimum error.”
An example of this is the Least Means Squares (LMS) objective function. The parallelization of some objective functions can result in tremendous performance increases and certain functions have shown very strong scaling. Since most objective functions are floating point intensive and can be vectorized, the performance of these algorithms can take advantage of both the Intel Xeon CPUs and Intel Xeon Phi coprocessors. The combination of scalable performance and efficient vectorization allow for large performance improvements over a single threaded, scalar implementation.
An example of these algorithms is a neural network training code. The following techniques were used to gain performance from the original application code.
- Naïve implementation – each node in the neural network was mapped to a processor. This approach results in the need for a lot of memory bandwidth and network communication.
- Improved Method – map multiple network nodes to the same processor, resulting in less network communication.
- Farber mapping – use of multiple Intel Xeon Phi coprocessors.
Linear speedup up to 3000 Intel Xeon Phi coprocessor nodes has been measured running on the TACC Stampede system. This level of scaling was achieved using the following steps.
- Broadcast parameters to all the processors. The MPI broadcast was determined to be extremely efficient.
- Have each MPI node calculate a partial error for the data subset contained in local memory.
- Use as little MPI calls as necessary and keep the processors busy.
While a training process is computationally expensive, the process can be mapped well to a large cluster with SIMD hardware. Neural networks can be designed to take advantage of petascale computing power by using tools that are easy to obtain.
Source: Rob Farber from TechEnablement.com, as part of the new book, High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches.