When assessing cloud HPC, don't forget queue wait times

Print Friendly, PDF & Email

This is the point of Ian Foster’s post from Wednesday on his own blog. Ian starts with Ed Walker’s paper, which I talked about this week after I finally pulled it out of my “to read” queue.

You’ll recall that Ed compared an NCSA supercomputer with Amazon’s EC2 on the NAS parallel benchmarks and showed that EC2 was way slower, even on jobs that require no internode communication. At the end of my post I observed that such comparisons are only meaningful if you have a choice; in the absence of access to a supercomputer, EC2 at least lets you get the job done, even if it takes longer.

Ian makes a different point

However, before we conclude that EC2 is no good for science, I’d like to suggest that we consider the following question: what if I don’t care how fast my programs run, I simply want to run them as soon as possible? In that case, the relevant metric is not execution time but elapsed time from submission to the completion of execution. (In other words, the time that we must wait before execution starts becomes significant.)

This is right. Ian’s example is that one of the runs from Ed’s paper takes 300 seconds to start up on EC2 and 100 seconds to run, but the probability of that same job starting in 400 seconds on the NCSA supercomputer is only about 34%.

In fact, a 25 second job is somewhat unrepresentative of my own user community, and I was curious what the statistical start time would look like for a 25 hour job — it’s even worse. The probability of a start in 400 seconds is only about 15% in this case, there is only a 32% chance the job will start in 10 hours, and at 100 hours it’s still less than 83%. So, if a guaranteed start time matters, EC2 may be a good option.

The run times on Abe and EC2 for the LU example that Ian pulls out of Ed’s paper are 25 seconds and 100 seconds, respectively. That’s a 4x expansion in walltime. If we assume the same factor for my 25 hour job on Abe, we predict that EC2 will take 100 hours to finish the job. So, EC2 takes 100 + .08 hours to complete the job (run time plus 400 seconds start up time). Abe has a 72% chance of finishing my job in the same amount of total time (75 hours wait, 25 hours run). So in my example if you care about when your job finishes, its a tie a little less than 3 out of 4 times, factoring queue wait times into the mix. Something else to think about.

Trackbacks

  1. […] contrary opinions from the folks who actually run supercomputers. And thanks to John West over at InsideHPC, I read a blog post by Ian Foster, associate division director for Mathematics and Computer Science […]

  2. […] contrary opinions from the folks who actually run supercomputers. And thanks to John West over at InsideHPC, I read a blog post by Ian Foster, associate division director for Mathematics and Computer Science […]

  3. […] contrary opinions from the folks who actually run supercomputers. And thanks to John West over at InsideHPC, I read a blog post by Ian Foster, associate division director for Mathematics and Computer Science […]

  4. […] contrary opinions from the folks who actually run supercomputers. And thanks to John West over at InsideHPC, I read a blog post by Ian Foster, associate division director for Mathematics and Computer Science […]

  5. […] contrary opinions from the folks who actually run supercomputers. And thanks to John West over at InsideHPC, I read a blog post by Ian Foster, associate division director for Mathematics and Computer Science […]

  6. […] contrary opinions from the folks who actually run supercomputers. And thanks to John West over at InsideHPC, I read a blog post by Ian Foster, associate division director for Mathematics and Computer Science […]

  7. […] contrary opinions from the folks who actually run supercomputers. And thanks to John West over at InsideHPC, I read a blog post by Ian Foster, associate division director for Mathematics and Computer Science […]

  8. […] right in that we still have a long way to go given some of the results we’re seeing now (see here and here, for example, on EC2 results) but the flexibility this would give center operators in […]

  9. […] motivator, but as you’ll read in this series of posts between Ian Foster and I (here, here, and here) it always underpins ‘when’ your answer is available. Penguin’s offering is not virtualized, […]

Comments

  1. “…what if I don’t care how fast my programs run, I simply want to run them as soon as possible? In that case, the relevant metric is not execution time but elapsed time from submission to the completion of execution. (In other words, the time that we must wait before execution starts becomes significant.)”

    Those are two different metrics. Elapsed time from submission to completion of execution is called “sojourn time,” the time we muct wait before execution starts is called “waiting time.” Those concepts are widely used in fields such as stochastic systems, noon-deterministic scheduling theory, etc. Google Scholar and CiteSeer are your friends.