But not all HPC applications are closely coupled.
We’ve written about the DOE’s Magellan clouds-for-science experiment before (here and here). It’s not a new thing, but Federal Computing Week is talking about some early results
“For the more traditional MPI applications there were significant slowdowns, over a factor of 10,” said Kathy Yelick, division director for the National Energy Research Scientific Computing division. NERSC is partnered on the Magellan cloud project with the Argonne National Laboratory.
This isn’t new information — we had it in a fashion back in 2008 when Walker published his paper (which I wrote about here). But Walker was using Amazon’s EC2, and NERSC is using a purpose-built cloud, so it is a valuable refinement of our picture of the cloud world as it relates to HPC.
But not all HPC jobs are so tightly coupled that inter-processor communication is a limiting factor. The DOD have whole computational areas that are gated by their ability to perform complex parameter space studies with tens or hundreds of thousands of what are essentially one processor (or one node) jobs. A time-shared HPC system set up to facilitate dozens of thousand-processor jobs often inhibits the kind of queue-stuffing that the parameter study crowd needs, and their requirements can be bursty, so exploring other alternatives is a good use of time.
Kathy’s team has identified another area of science that might benefit from a cloud environment
However, for computations that can be performed serially, such as genomics calculations, there was little or no deterioration in performance in the commercial cloud, Yelick said. Magellan directors recently set up a collaboration with the Joint Genome Institute to carry out some of the institute’s computations at the Magellan cloud testbed.
This article would be much better if it told what are the slowdowns compared to.
Actually I thought it’s comparison against some other way of expressing parallelism and I had to read the original link to find out that I was wrong.
You missed mentioning an important point raised by Ian Foster a while ago in his blog: The job run time may be slower, but the total time to solution much better because you don’t have to wait in a queue.
Or maybe he was subliminally pushing for more or bigger supercomputing centers to reduce the queue lengths. 🙂
No, I didn’t miss that point, Greg. In fact, Ian and I had a long comment thread on my site and his at the time I pointed to Walker’s paper last year. In a private (and thus relatively finite-sized) cloud like NERSC’s a substantial move to a cloud platform for science would still have the wait times that a traditional batch system would have (more demand than capacity). Also, knowing that the cloud software/hardware stack itself is inherently slower puts a lower bound on run time.
Since there has been some followup discussion, I wanted to clarify and add some context. The factor of 10 was a comparison between our unvirtualized Magellan hardware and Amazon’s Elastic Compute Cloud (EC2) using m1.xlarge instances. We ran the NERSC6 benchmarks to perform the comparison. For the seven applications we tested, the mean slowdown factor for EC2 relative to Magellan was 10.8. The best application, GAMESS, was 2.7 times slower, while the worst performance was with PARATEC, which was 51.8 times slower. Again, the Magellan results were on unvirtualized hardware with an Infiniband interconnect.
We are in the process of repeating the benchmarks on Magellan and varying aspects of the configuration to better isolate the reasons for the slowdown and to enable more cloud-like features on the Magellan testbed. The goal is to understand the value of clouds for scientific computing.
Kathy – thanks so much for checking in and following up.