In this week’s Sponsored Post, Intel and Tata demonstrate how FSI applications can run efficiently on an HPC architecture. This is the third in a series of articles on “6 Things You Should Know About Lustre.” Other topics cover Lustre in Enterprise, the Cloud, Financial Services, and next-generation storage.
Across industries, companies are beginning to watch the convergence of High-performance Computing (HPC) and Big Data. Many organizations in the Financial Services Industry (FSI) are running their financial simulations on business analytics systems, some on HPC clusters. But they have a growing problem: integrating analytics of non-structured data from sources like social media with their internal data.
The information from these outside sources can have an important impact on their businesses,” says Ute Gojrzewski from Intel’s High Performance Data Division (HPDD). “They need to be able to correlate this content with their SQL data.”
“For example, insurance companies are watching Google Car,” says Gabriele Paciucci, a Solutions Architect with Intel’s HPDD, “and listening to what people are saying on social media about it and services like Uber.” They are analyzing how these new services could affect their long-term revenues. “During the China crisis recently,” adds Mr. Paciucci, “there was a lot of talk in the financial investments and banking sectors on social media. Companies in these sectors were listening.”
Hadoop is designed to deal with any kind of data—structured and unstructured. So, these FSI companies are evaluating the Hadoop framework. But Hadoop runs as a cluster of distributed nodes with local storage, unlike an HPC infrastructure running a Lustre parallel file system. Intel is working on merging the two.
“This convergence of Big Data on HPC is very recent,” adds Ms. Gojrzewski. “So we at Intel are coming up with ways to make it efficient and performant. We want to show customers how well Big Data can run on HPC.”
“We wanted to see how real financial services applications, not benchmarks, and real data would perform in a Hadoop framework on top of an HPC architecture.” – Rekha Singhal, Tata Consulting Services
Tata Consulting Services (TCS) was looking at the same problem.
“We provide IT services on a large scale across a wide range of industries,” says Rekha Singhal, Senior Scientist with TCS. “We wanted to see how real financial services applications, not benchmarks, and real data would perform in a Hadoop framework on top of an HPC architecture.”
TCS wanted to know if they could run Hadoop without moving the data into and out of Lustre. “Generally, companies who want to do Big Data analysis are adopting the Hadoop platform,” adds Ms. Singhal. “But, if they have HPC, they have to move data from Lustre to HDFS, do the analysis using MapReduce processing, and then read the data back to Lustre and the HPC applications to control the financial simulations.”
Ms. Singhal and her colleagues were looking at two problems. Data sets for financial and insurance could be massive—up to four terabytes and larger—so moving that much data between the Hadoop cluster and Lustre file system was inefficient. And, creating a new Hadoop cluster with local storage just to run MapReduce jobs would be expensive for customers. “Our objective was to come up with a platform for Hadoop data analysis using an HPC cluster that would give us good performance,” says Singhal.
Intel and TCS worked together to optimize their financial services and insurance applications for Lustre on HPC using Intel® Xeon® processors, Intel® SSDs, and Intel® True Scale fabric for InfiniBand*.
“Our objective was to come up with a platform for Hadoop data analysis using an HPC cluster that would give us good performance.” –Rekha Singhal
“We used the adaptors in Intel® Enterprise Edition for Lustre software to connect Hadoop to the Lustre file system and run MapReduce in an HPC environment,” comments Singhal. “We ran two very complex queries using real applications, some with joining operations, java code, and SQL, on each of the financial and insurance data sets. To exercise the test fully, we used different data sizes and levels of concurrency. When we did the evaluations with Hadoop with Lustre as well as Hadoop with HDFS, we found the solution with Lustre ran three times faster.”
According to Mr. Paciucci, when the data sets are small, as much as 200 gigabytes, HDFS can feed MapReduce and output to Lustre without slowing down the process. “But when the data scales to terabytes, Lustre is faster.” Paciucci points out that with access to petabytes of social media data—and growing—company data sets will continue to expand exponentially. “Lustre is the only file system that can scale with those kinds of data volumes and serve it up efficiently for Hadoop and Big Data analysis.”
“Intel and Tata have shown how actual FSI applications can run fast and efficiently on top of an HPC architecture,” says Ms. Singhal. “We are sharing this information with our customers.” According to Paciucci, other customers in FSI are also looking at this convergence with interest. “Companies in FSI are very conservative, and porting their applications to Hadoop is difficult,” he says. “If they don’t already have HPC, it’s even more complicated.”
“The convergence has started,” says Ms. Gojrzewski. “The full process will take time. But with Intel’s solutions and companies like Tata proving its feasibility and providing the integration, it is coming quickly.”
Learn more about Intel® Solutions for Lustre software, or read the previous entries in the Six Things You Should Know About Lustre series:
*Other trademarks and brands may be the property of other companies.