In this RCE Podcast, Marcel Kornacker from Cloudera describes the Impala project. Impala brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation. Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software.
“This talk will present RDMA-based designs using OpenFabrics Verbs and heterogeneous storage architectures to accelerate multiple components of Hadoop (HDFS, MapReduce, RPC, and HBase), Spark and Memcached. An overview of the associated RDMAenabled software libraries being designed and publicly distributed as a part of the HiBD project.”
Hadoop and Spark clusters have a reputation for being extremely difficult to configure, install, and tune, but help is on the way. The good folks at Cluster Monkey are hosting a crash course entitled Apache Hadoop with Spark in One Day. “After completing the workshop attendees will be able to use and navigate a production Hadoop cluster and develop their own projects by building on the workshop examples.”
Today Florida Atlantic University (FAU) announced that it is using Bright Cluster Manager software for its HPC cluster. The 56-node cluster is used for teaching Hadoop Map Reduce, bioinformatics research and other modeling and visualization work. Administrators say Bright Cluster Manager has significantly increased automation and is easily scalable to meet expected future growth.
“If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” These “data lake” systems will hold massive amounts of data and be accessible through file and web interfaces. Data protection for data lakes will consist of replicas and will not require backup since the data is not updated. Erasure coding will be used to protect large data sets and enable fast recovery. Open source will be used to reduce licensing costs and compute systems will be optimized for map reduce analytics. Automated tiering will be employed for performance and long-term retention requirements. Cold storage, storage that will not require power for long-term retention, will be introduced in the form of tape or optical media.”
“In business and commercial computing, momentum towards cloud and big data has already built up to the point where it is unstoppable. In technical computing, the growth of the Internet of Things is pressing towards convergence of technologies, but obstacles remain, in that HPC and big data have evolved different hardware and software systems while Open Stack, the Open Source cloud computing platform, does not work well with HPC.”
As an open source tool designed to navigate large amounts of data, Hadoop continues to find new uses in HPC. Managing a Hadoop cluster is different than managing an HPC cluster, however. It requires mastering some new concepts, but the hardware is basically the same and many Hadoop clusters now include GPUs to facilitate deep learning.
In this video, Alexandru Iosup from the TU Delft presents: Scalable High Performance Systems. “During this masterclass, Alexandru discussed several steps towards addressing interesting new challenges which emerge in the operation of the datacenters that form the infrastructure of cloud services, and in supporting the dynamic workloads of demanding users. If we succeed, we may not only enable the advent of big science and engineering, and the almost complete automation of many large-scale processes, but also reduce the ecological footprint of datacenters and the entire ICT industry.”
Today Intel Corporation and BlueData announced a broad strategic technology and business collaboration, as well as an additional equity investment in BlueData from Intel Capital. BlueData is a Silicon Valley startup that makes it easier for companies to install Big Data infrastructure, such as Apache Hadoop and Spark, in their own data centers or in the cloud.
In this slidecast, Chris Porter and Jeff Kamiol from IBM describe how IBM High Performance Services deliver versatile, application-ready clusters in the cloud for organizations that need to quickly and economically add computing capacity for high performance application workloads.