Embrace Innovation While Reducing Risk: The Three Steps to AI-grade Data at Scale

In this contributed article, Kunju Kashalikar, Senior Director of Product Management at Pentaho, discusses how to dream big without the risk: three steps to AI-grade data. The industry adage of ‘garbage-in-garbage-out’ has never been more applicable than now. Clean, accurate data is the key to winning the AI race – but leaving the starting blocks is the challenge for most. Winning the race means working with data that’s match fit for AI.

Chalk Talk: What is a Data Lake?

“If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” These “data lake” systems will hold massive amounts of data and be accessible through file and web interfaces. Data protection for data lakes will consist of replicas and will not require backup since the data is not updated. Erasure coding will be used to protect large data sets and enable fast recovery. Open source will be used to reduce licensing costs and compute systems will be optimized for map reduce analytics. Automated tiering will be employed for performance and long-term retention requirements. Cold storage, storage that will not require power for long-term retention, will be introduced in the form of tape or optical media.”