Doug Eadline posted over at Linux World this week about a possible useful feature for virtualization in an HPC context that I’ve been thinking about for a while: migration and checkpointing.
One of the questions I am often asked is “What is the virtualization play for HPC?” I usually reply that there are issues that need to be resolved before virtualization and HPC walk hand in hand, but process migration in the form of check pointing would be a great thing to have. Thinking about the “over there” vs “here” problem in terms of virtualization, however, may just be the killer HPC/Virtualization application that solves a big problem.
The context (the over there/here part from that quote) is that when you migrate a process one of the things you are worried about “over there” is whether you have the “stuff” you need to run the job when it gets there.
Imagine, creating a tested working image of your application, operating system, and file system and running it on a virtual HPC machine. The “over there” vs “here” problem goes away because, “over there” is “what is here.” Of course, we talking about scale and pushing a large number of images out to thousands of nodes is an issue. And, notice I threw in the file system part. I believe before HPC can be virtualized (or “clouded”) the I/O issue (both compute and file system) needs to be resolved. I suspect this will be through some form of I/O specification that travels with the job image. The specification will allow the cloud the run the application on the right hardware. The current cloud definition is rather loose when it comes to I/O (i.e. it will be there, just can’t say how exactly fast or consistent it will be).
Doug’s right in that we still have a long way to go given some of the results we’re seeing now (see here and here, for example, on EC2 results) but the flexibility this would give center operators in scheduling would be pretty nifty.