NY Times: 11 million PDFs, Amazon, and MapReduce

March 2, 2008 by Doug Black

Blog post at the NY Times today from the tech guy responsible for converting the NYT content from 1851 to 2002 to PDF. He did it in under 24 hours with 100 Amazon EC2 machines, Hadoop, and some scripts.

I had been using Amazon S3 service for some time and was quite impressed. And in late 2006 I had begun playing with Amazon EC2. So the the basic idea I had was this: upload 4TB of source data into S3, write some code that would run on numerous EC2 instances to read the source data, create PDFs, and store the results back into S3. S3 would then be used to serve the PDFs to the general public. It all sounded pretty simple, and that is how I got the folks in charge to agree to such an idea — not to mention that Amazon S3/EC2 is pretty easy on the wallet.

..For deployment, I created a custom AMI (Amazon Machine Image) for EC2 that was based on a Xen image from my desktop machine. Using some simple Python scripts and the boto library, I booted four EC2 instances of my custom AMI. I logged in, started Hadoop and submitted a test job to generate a couple thousands articles — and to my surprise it just worked.

4 TB of data in, and 1.5 TB of new PDFs out. And now NYT content back to 1851 is free.

NY Times: 11 million PDFs, Amazon, and MapReduce

Sponsored Guest Articles

Accelerated HPC for Energy Efficiency with AWS and NVIDIA

White Papers

Energy efficiency drives HPC to the cloud

Comments

Featured RSS Feed

More News from insideBIGDATA

NY Times: 11 million PDFs, Amazon, and MapReduce

Sponsored Guest Articles

Accelerated HPC for Energy Efficiency with AWS and NVIDIA

White Papers

Energy efficiency drives HPC to the cloud

Join Us On Social Media

Comments

Related Posts

Featured RSS Feed

More News from insideBIGDATA