Gary Orenstein has an interesting post at GigaOm called, How Yahoo, Facebook, Amazon & Google Think About Big Data. These companies all have developed their own approaches to storing petabytes of data that, unlike much of the data in high end computing, actually gets used more than once after it is written.
Yahoo! has MObStor, Facebook has Haystack, Amazon has Dynamo, and then, of course, there is the Google File System.
Since MObStor, based on when information was released, is the new kid on the block, let’s take a look at some of its standout characteristics:
- It’s designed for petabyte-scale content that is site-generated, partner-generated, or user-generated
- Handles tens of thousands of page views every second
- Reads dominate writes (most data is WORM: write-once read-many)
- Only a low level of consistency is required
- It is designed to scale quickly and efficiently
One thing that all of these approaches have in common is really smart software on top of really cheap hardware. Which is not how most of the storage technology in HPC is built. It will be interesting to see what happens to our storage technologies as more HPC applications come on line to deal specifically with the incredible volumes of unstructured data that businesses and researchers increasingly need to deal with. I wonder if they will push our community into a crisis akin to the one created by the economics of commodity CPU shift?