"Parallel Programming in the Age of Big Data"

Print Friendly, PDF & Email

An article by that title over at GigaOM has gotten a lot of attention on the interwebs over the past couple weeks, so I hereby present you with a link to said article, plus an excerpt to keep you warm while you sink into your pre-Thanksgiving lethargy

In this context, there is some good news for parallel programming. Data analysis software parallelizes fairly naturally. In fact, software written in SQL has been running in parallel for more than 20 years. But with “Big Data” now becoming a reality, more programmers are interested in building programs on the parallel model — and they often find SQL an unfamiliar and restrictive way to wrangle data and write code. The biggest game-changer to come along is MapReduce, the parallel programming framework that has gained prominence thanks to its use at web search companies.

…SQL provides a higher-level language that is more flexible and optimizable, but less familiar to many programmers. MapReduce largely asks programmers to write traditional code, in languages like C, Java, Python and Perl. In addition to its familiar syntax, MapReduce allows programs to be written to and read from traditional files in a filesystem, rather than requiring database schema definitions. MapReduce is such a compelling entryway into parallel programming that it is being used to nurture a new generation of parallel programmers. Every Berkeley computer science undergraduate now learns MapReduce, and other schools have undertaken similar programs. Industry is eagerly supporting these efforts.

If you aren’t at least passingly familiar with MapReduce (or its OpenSource twin, Hadoop) you should read this article so you can at least nod convincingly when someone smarter than you suggests that problem X is an ideal candidate for Hadoop and “by the way why haven’t you tried that yet?”