You’ll recall that Ed compared an NCSA supercomputer with Amazon’s EC2 on the NAS parallel benchmarks and showed that EC2 was way slower, even on jobs that require no internode communication. At the end of my post I observed that such comparisons are only meaningful if you have a choice; in the absence of access to a supercomputer, EC2 at least lets you get the job done, even if it takes longer.
Ian makes a different point
However, before we conclude that EC2 is no good for science, I’d like to suggest that we consider the following question: what if I don’t care how fast my programs run, I simply want to run them as soon as possible? In that case, the relevant metric is not execution time but elapsed time from submission to the completion of execution. (In other words, the time that we must wait before execution starts becomes significant.)
This is right. Ian’s example is that one of the runs from Ed’s paper takes 300 seconds to start up on EC2 and 100 seconds to run, but the probability of that same job starting in 400 seconds on the NCSA supercomputer is only about 34%.
In fact, a 25 second job is somewhat unrepresentative of my own user community, and I was curious what the statistical start time would look like for a 25 hour job — it’s even worse. The probability of a start in 400 seconds is only about 15% in this case, there is only a 32% chance the job will start in 10 hours, and at 100 hours it’s still less than 83%. So, if a guaranteed start time matters, EC2 may be a good option.
The run times on Abe and EC2 for the LU example that Ian pulls out of Ed’s paper are 25 seconds and 100 seconds, respectively. That’s a 4x expansion in walltime. If we assume the same factor for my 25 hour job on Abe, we predict that EC2 will take 100 hours to finish the job. So, EC2 takes 100 + .08 hours to complete the job (run time plus 400 seconds start up time). Abe has a 72% chance of finishing my job in the same amount of total time (75 hours wait, 25 hours run). So in my example if you care about when your job finishes, its a tie a little less than 3 out of 4 times, factoring queue wait times into the mix. Something else to think about.