Accelerate Your Apache Spark with Intel Optane DC Persistent Memory

Print Friendly, PDF & Email

Piotr Balcer from Intel

In this video from the 2019 Spark+AI Summit, Piotr Balcer and Cheng Xu from Intel present: Accelerate Your Apache Spark with Intel Optane DC Persistent Memory.

The capacity of data grows rapidly in big data area, more and more memory are consumed either in the computation or holding the intermediate data for analytic jobs. For those memory intensive workloads, end-point users have to scale out the computation cluster or extend memory with storage like HDD or SSD to meet the requirement of computing tasks. For scaling out the cluster, the extra cost from cluster management, operation and maintenance will increase the total cost if the extra CPU resources are not fully utilized. To address the shortcoming above, Intel Optane DC persistent memory (Optane DCPM) breaks the traditional memory/storage hierarchy and scale up the computing server with higher capacity persistent memory. Also it brings higher bandwidth & lower latency than storage like SSD or HDD. And Apache Spark is widely used in the analytics like SQL and Machine Learning on the cloud environment. For cloud environment, low performance of remote data access is typical a stop gap for users especially for some I/O intensive queries. For the ML workload, it’s an iterative model which I/O bandwidth is the key to the end-2-end performance. In this talk, we will introduce how to accelerate Spark SQL with OAP to accelerate SQL performance on Cloud to archive 8X performance gain and RDD cache to improve K-means performance with 2.5X performance gain leveraging Intel Optane DCPM. Also we will have a deep dive how Optane DCPM for these performance gains.

Cheng Xu is a senior architect of Intel Big Data team.

Cheng Xu is a senior architect of Intel Big Data team. He worked for big data area for more than 6 years. Current his focus is about IA optimization for data analytics area. Before that, he worked as the key contributor for Intel(R) Distribution for Apache Hadoop.

Piotr Balcer is a software engineer with many years of experience working on storage related technologies at Intel Corporation. He received B.Eng. from the Gdansk University of Technology in 2014 where he studied system software engineering. For fours years now he has been working on software ecosystem for next-gen non-volatile memory.

Sign up for our insideHPC Newsletter