NY Times: 11 million PDFs, Amazon, and MapReduce

Print Friendly, PDF & Email

Blog post at the NY Times today from the tech guy responsible for converting the NYT content from 1851 to 2002 to PDF. He did it in under 24 hours with 100 Amazon EC2 machines, Hadoop, and some scripts.

I had been using Amazon S3 service for some time and was quite impressed. And in late 2006 I had begun playing with Amazon EC2. So the the basic idea I had was this: upload 4TB of source data into S3, write some code that would run on numerous EC2 instances to read the source data, create PDFs, and store the results back into S3. S3 would then be used to serve the PDFs to the general public. It all sounded pretty simple, and that is how I got the folks in charge to agree to such an idea — not to mention that Amazon S3/EC2 is pretty easy on the wallet.

..For deployment, I created a custom AMI (Amazon Machine Image) for EC2 that was based on a Xen image from my desktop machine. Using some simple Python scripts and the boto library, I booted four EC2 instances of my custom AMI. I logged in, started Hadoop and submitted a test job to generate a couple thousands articles — and to my surprise it just worked.

4 TB of data in, and 1.5 TB of new PDFs out. And now NYT content back to 1851 is free.

Comments

  1. Very interesting, how did he upload 4T?

  2. Mark – not sure…the article didn’t specify. If I run across anything I’ll post it here.