This is the final article is an editorial series that explores high performance storage and the benefits of Lustre solutions for HPC. This week we look at unleashing the power of parallel storage to boost MapReduce application performance.
From Wall Street to the Great Wall, enterprises and institutions of all sizes are faced with the benefits – and challenges – promised by ‘Big Data’. But before users can take advantage of the near limitless potential locked within their data, they must have affordable, scalable and powerful software tools to manage the data.
High performance infrastructure workloads have expanded and are now key technologies used by today’s forward-looking commercial computer users. Parallel storage solutions powered by Lustre storage software have found a new home in these data-intensive business operations, and Apache Hadoop* has become the framework of choice for big data analytics. Hadoop transforms enormous amounts of data into manageable distributed datasets that applications can more easily analyze.
When organizations operate both Lustre and Hadoop within a shared infrastructure, there is a strong case for using Lustre as the file system for Hadoop analytics as well as HPC storage. Hadoop users can access any Lustre files directly from Hadoop, without the need to copy them over to the Hadoop environment. Using Lustre in combination with Hadoop also makes storage management simpler— since the platform will be running a single Lustre file system instance rather than Hadoop instances for each cluster—and makes more productive use of storage assets.
Moreover, Hadoop’s own file system, referred to as HDFS, is inconsistent with the HPC paradigm of decoupling computation from storage, as HDFS expects storage disks to be locally attached to individual compute nodes.
And, since HDFS is not POSIX-compliant—meaning that it does not conform to standards that maintain compatibility between operating systems—it suffers the performance overhead of moving extremely large datasets in and out of Lustre for staging I/O throughput. Fortunately, Hadoop uses a storage abstraction layer for accessing persistent data, thus allowing the potential for plugging in different types of file systems. Lustre can be made to comply with Hadoop’s storage requirements by implementing its Java* file system API. Since Lustre is POSIX-compliant and can be mounted like an NFS, it is able to exploit Java’s inherent support for native file systems.
The only additional step for mounting Lustre as the file system for Hadoop analytics is to convey to the Hadoop task scheduler that Lustre is indeed a distributed file system and the input data are accessible uniformly from all the compute nodes. This allows tasks to be scheduled on any node independent of data locality, so all Hadoop “compute” nodes can access any data, eliminating the need to move the data itself between nodes. Additional optimization is possible by allowing reducers to read intermediate map outputs directly from the shared file system and eliminating the overhead of streaming large files over HTTP.
Unique to Intel EE for Lustre software, innovative software ‘connectors’ are included that overcome the challenges posed by HDFS, and allows users to run MapReduce* applications directly on Lustre. These ‘connectors’ optimize the performance of MapReduce processing while delivering faster, more scalable and easier to manage storage.
We hope you enjoyed this 6 part editorial series. If you missed an article you can download the complete guide to Lustre Solutions in a PDF from the insideHPC White Paper Library courtesy of Intel.