Sign up for our newsletter and get the latest HPC news and analysis.

MIT team looks at CFD on multicore chips and finds less is more

Dr. Dobb’s is reporting this week on a research effort by a team of MIT researchers to parallelize a CFD application simulating flow in an oilfield on a single 24-core node. They observed a 22x speedup on 24 cores (for this application) by taking advantage of the fact that communication between cores on a single node (or die) is (usually much) faster than communication between nodes.

When such simulations run on a cluster of computers, the cluster’s management system tries to minimize the communication between computers, which is much slower than communication within a given computer. To do this, it splits the model into the largest chunks it can — in the case of the weather simulation, the largest geographical regions — so that it has to send them to the individual computers only once. That, however, requires it to guess in advance how long each chunk will take to execute. If it guesses wrong, the entire cluster has to wait for the slowest machine to finish its computation before moving on to the next part of the simulation.

In a multicore chip, however, communication between cores, and between cores and memory, is much more efficient. So the researchers’ system can break a simulation into much smaller chunks, which it loads into a queue. When a core finishes a calculation, it simply receives the next chunk in the queue. That also saves the system from having to estimate how long each chunk will take to execute. If one chunk takes an unexpectedly long time, it doesn’t matter: The other cores can keep working their way through the queue.

The ideas here — programming for the hardware you have (not the hardware you used to have) and work stealing — aren’t new, but I think this work is of interest because it highlights that a 24-core node on a single board is really not the same as a 24-way SMP from years gone by, and as we move to ever larger computers we need to be thinking again about parallelism at all the scales available to us. Multicore chips are common enough now that an application that used MPI between nodes and some else for the finer-grained parallelism within a node doesn’t introduce deal-breaking portability issues and orphan forks of the source tree.

Comments

  1. I find it very hard to believe that so many people picked up on this story. At least John correctly identifies that THIS TECHNIQUE IS NOT NEW. Not even close. It actually offends me a bit that this “technique” is being paraded around as a new and/or multicore-specific technique (Dr. Dobbs, HPC Wire, etc.). It’s neither.

    Work sharing is so not new that it was old when I was a grad student (and *that* was a long time ago).

    /me gets off my soap box…

  2. John West says:

    Yes, Jeff, you are 100% right. I covered it mostly because SO MANY other people did. I felt like the story needed to be acknowledged here, and I did use it as a chance to get on my own soap box re SMP vs multicore.

  3. I’ll just point out that this is not an original article written by Dr. Dobb’s. Rather, it is a press release issued by MIT (see http://web.mit.edu/newsoffice/2010/multicore-0426.html). The author, Larry Hardesty, works for the MIT news office, though his affiliation isn’t noted in the article published at DD.

  4. John West says:

    Brian – thanks. I tried to word it such that I didn’t claim it as original content for DD, because they have a bad habit of running press releases without noting them as such. I appreciate you keeping us honest.

Trackbacks

  1. %page% says:

    […] MIT team looks at CFD on multicore chips and finds less is more | insideHPC.com […]

  2. […] MIT team looks at CFD on multicore chips and finds less is more […]

Resource Links: