In this video from the Intel HPC Developer Conference, Ananth Sankaranarayanan from Intel describes how the company is optimizing Machine Learning frameworks for Intel platforms. Open source frameworks often are not optimized for a particular chip, but bringing Intel’s developer tools to bear can result in significant speedups.
“Availability of big data coupled with fast evolution of better hardware, smarter algorithms and optimized software frameworks are enabling organizations create unique opportunities for machine learning and deep learning analytics for competitive advantage, impactful insights, and business value. Caffe is one of most popular open source frameworks developed by Berkeley Vision Learning Center (BVLC) for deep learning application for image recognition, natural language processing (NLP), automatic speech recognition (ASR), video classification and other domains in artificial intelligence. Intel has extensively contributed to an optimized fork of Caffe for Intel Xeon, Xeon Phi, and Xeon+FPGA CPUs. Convolutional Neural Networks (CNNs) are extensively used in image recognition for deep learning training and building an accurate model, which then can be used for scoring in applications such as Advanced Driver Assistance System (ADAS) in the automotive industry for driverless vehicles, in medicine, finance, etc. Deep learning training is highly compute intensive and can take a very long time from multiple weeks to days on large datasets. For meaningful impact and business value, organizations require that the time to train a deep learning model be reduced from weeks to hours. In this talk, we will present the details of the optimization and characterization of Intel-Caffe and the support of new deep learning convolutional neural network primitives in the Intel Math Kernel Library. We will present performance data for deep learning training for image recognition achieving >24X speedup performance with a single Xeon Phi 7250 compared to BVLC Caffe. In addition, we will also present performance data that shows training time is further reduced by 40X speedup with a 128-node Xeon Phi cluster over Omni-Path Architecture. Furthermore, we will also present data that shows >17X speedup for image scoring with 2P Xeon E5-2699 v4. These performance results were critical components of KNL, generating very strong interest in Xeon / Xeon Phi for ML/DL using Intel-Caffe and displacing Nvidia as the only performant solution.”