A bit about MapReduce

Print Friendly, PDF & Email

If you’ve heard about Google’s MapReduce (and its open source clone, Hadoop) but have been too…uh…busy to dive into technical papers, you should check out this post. While making its point about the usefulness of MapReduce in the face of some well-publicized (and roundly refuted) recent criticism, it serves as a good 50,000 foot view of why you should care about it.

Over the past four years, MapReduce has become the standard way in which all large-scale data processing is carried out at Google. More than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google’s clusters every day, processing a total of more than twenty petabytes of data per day! And Google just happens to be the most well-known “big data” company in the world, so where they lead, others often follow.

…MapReduce/Hadoop, being essentially an implicitly parallel programming model, is really easy for developers to learn and to use effectively on very large scale parallel systems.

The info in this article is the minimum you should know about MapReduce as an HPC practictioner. IMHO, of course.