High Performance Big Data Computing Using Harp-DAAL

Print Friendly, PDF & Email

Sponsored Post

Many businesses are beginning to rely on large scale data analytics for greater insights into their customers’ behavior and their business requirements. Simplifying the process so that a wider range of employees can make conclusions from the massive amounts of data is important and can lead to more profits and better customer service.  Harp-DAAL is a framework developed at Indiana University that brings together the capabilities of big data (Hadoop) and techniques that have previously been adopted for high performance computing.  Together, employees can become more productive and gain deeper insights to massive amounts of data.

Modern analytics systems are clusters of independent systems which need to be synchronized in order to make sense of all of the data. Harp uses the Intel Data Analytics Accelerations library (Intel DAAL), which has been previously written about in this column. Intel DAAL is optimized for a variety of underlying architectures that specifically take advantage of the Intel Xeon CPU and the Intel Xeon Phi processor. High level tools can be written to orchestrate the exchange of data between nodes, while Intel DAAL accelerates the processing on each node.

[clickToTweet tweet=”Harp-DAAL speeds of analytics.” quote=”Harp-DAAL can speed up your analytics processing by combining HPC techniques.”]

Data intensive computing can be categorized into five different architectures, namely sequential input, batch architectures, Map-Only, Map-Reduce, Iterative Map Reduce, and the HPC influenced Message Passing Interface (MPI) distributed model. Harp gives developers and users the ability to use all of these classes of data intensive computation. In addition, Harp is a modular software stack that allows developers to include machine learning directly into their big data analytics applications.  By utilizing Intel DAAL at a lower level, many types of algorithms can be accelerated in the context of learning more about the available data and incorporating predictive analytics.

Intel DAAL provides a native C/C++ API but also provides interfaces to higher-level programming languages such as Java* and Python*. Harp is written in Java and extended from the Hadoop ecosystem, so Java was the natural choice to interface Harp and Intel DAAL.

From Intel Parallel Universe Issue 32,  K-means is a widely-used and relatively simple clustering algorithm that provides a clear example of how to use Harp-DAAL. K-means uses cluster centers to model data and converges quickly via iterative refinement. K-means clustering was performed on a large image dataset from Flickr*, which includes 100 million images, each with 4,096 dimensional deep features extracted using a deep convolutional neural network model trained on ImageNet*. Data preprocessing includes format transformation and dimensionality reduction from 4,096 to 128 using Principal Component Analysis (PCA). (From Intel Parallel Universe Issue 32)

The steps that would be needed to implement this K-mean example are as follows:

Step 1: Load Training Data (Feature Vectors) and Model Data (Cluster Centers)

Step 2: Convert Training Data from Harp to Intel DAAL

Step 3: Create and Set Up an Intel DAAL K-means Kernel

Step 4: Convert Center Format from Harp to Intel DAAL

Step 5: Local Computation by Intel DAAL Kernel

Step 6: Inter-Mapper Communication

Step 7: Release Memory and Store Cluster Centers

Benchmarks have shown a 30X improvement of Harp-DAAL over Spark for K-means on 30 nodes using highly vectorized kernels within Intel DAAL. The current Harp-DAAL system provides 13 distributed data analytics and machine learning algorithms leveraging the local computation kernels like K-means from the Intel DAAL 2018 release. In addition, Harp-DAAL is developing its own data-intensive kernels. This includes the large-scale subgraph counting algorithm given above, which can process a social network Twitter graph with billions of edges and subtemplates of 10 vertices in 15 minutes. The Harp-DAAL framework and machine learning algorithms are publicly accessible so you can download the software, explore the tutorials, and apply Harp-DAAL to other data-intensive applications.

Download Intel® Data Analytics Acceleration Library for free.