Bob Grossman, whose work I wrote about last week for HPCwire, has a post at his blog that addresses some of the FAQs regarding the processing of large datasets in a cloud infrastructure. The post is interesting, and if you are at all interested in clouds for large data, or scientific computing in general, I recommend a read.
In the post he sets a framework for discussion (what is large data?), and identifies several of the cloud solutions out there that are suitable for dealing with large data such as Aster, Sector, Hadoop, and Greenplum.
How do I get started? The easiest way to get started is to download one of the applications and to work through some basic examples. The example that most people work through is word count. Another common example is the terasort example (soring 10 billion 100 byte records where the first 10 bytes is the key that is sorted and the remaining 90 bytes is the payload). A simple analytic to try is MalStone, which I have described in another post.
I also commend Bob’s blog, From Data to Decisions, to your RSS reader.