Amazon drives compute utilization by acting as librarian on public data sets

Found at the Wired blog, news that Amazon is hosting public data on its servers

As of Thursday, the annotated human genome, US Census data, and countless 3D renderings of molecules  are available on the elastic servers. And users can upload their own boatload of information, without a fee. But there is a catch: If you want to crunch numbers on their server, or store the output there, it will not be free.

…”Public Data Sets on AWS will enable me and many of my colleagues to collaborate with each other by sharing our commonly used data sets, research environments and tools,” Peter Tonellato of Harvard Medical School said in the press release. “We can set up a controlled environment in minutes, run our computational analysis for a couple of hours, and shut down the environment. Our results are completely repeatable. I only pay for the compute time I use, and more importantly I can spend more time focusing on research, not downloading and setting up computational infrastructure.”

Lest you forget that Amazon is a business, it looks like you cannot use Amazon as a universal data directory, just downloading the bits you want and computing on your own resources. From Amazon’s own website on the new offering

Select public data sets are hosted on Amazon EC2 for free as Amazon Elastic Block Store (Amazon EBS) snapshots. Amazon EC2 customers can access this data by creating their own personal Amazon EBS volumes, using the public data set snapshots as a starting point. They can then access, modify and perform computation on these volumes directly using their Amazon EC2 instances and just pay for the compute and storage resources that they use. If available, researchers can also use pre-configured Amazon Machine Images (AMIs) with tools like Inquiry by BioTeam to perform their analysis.

