Chalk Talk: What is a Data Lake?

Print Friendly, PDF & Email

datalakeIn this video, George Crump of Storage Switzerland and Fred Oh from Hitachi Data Systems explain the term “Data Lake”, and what it means for today’s analytics tools such as Pentaho, Hadoop, and Cassandra. The discussion includes real-world use cases, and also demonstrates how the underlying storage, compute, virtualization and networking infrastructure is critical for scaling hyper-converged analytics platforms.

“While there will continue to be a high demand for scale up enterprise storage and compute systems, the growth of unstructured data and the value it has for Big Data analysis will require new types of distributed, scale out storage and compute systems. Pentaho CTO James Dixon is credited with coining the term “data lake”. “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” These “data lake” systems will hold massive amounts of data and be accessible through file and web interfaces. Data protection for data lakes will consist of replicas and will not require backup since the data is not updated. Erasure coding will be used to protect large data sets and enable fast recovery. Open source will be used to reduce licensing costs and compute systems will be optimized for map reduce analytics. Automated tiering will be employed for performance and long-term retention requirements. Cold storage, storage that will not require power for long-term retention, will be introduced in the form of tape or optical media.”

Sign up for our insideHPC Newsletter