Dr. Dobb’s is reporting this week on a research effort by a team of MIT researchers to parallelize a CFD application simulating flow in an oilfield on a single 24-core node. They observed a 22x speedup on 24 cores (for this application) by taking advantage of the fact that communication between cores on a single node (or die) is (usually much) faster than communication between nodes.
When such simulations run on a cluster of computers, the cluster’s management system tries to minimize the communication between computers, which is much slower than communication within a given computer. To do this, it splits the model into the largest chunks it can — in the case of the weather simulation, the largest geographical regions — so that it has to send them to the individual computers only once. That, however, requires it to guess in advance how long each chunk will take to execute. If it guesses wrong, the entire cluster has to wait for the slowest machine to finish its computation before moving on to the next part of the simulation.
In a multicore chip, however, communication between cores, and between cores and memory, is much more efficient. So the researchers’ system can break a simulation into much smaller chunks, which it loads into a queue. When a core finishes a calculation, it simply receives the next chunk in the queue. That also saves the system from having to estimate how long each chunk will take to execute. If one chunk takes an unexpectedly long time, it doesn’t matter: The other cores can keep working their way through the queue.
The ideas here — programming for the hardware you have (not the hardware you used to have) and work stealing — aren’t new, but I think this work is of interest because it highlights that a 24-core node on a single board is really not the same as a 24-way SMP from years gone by, and as we move to ever larger computers we need to be thinking again about parallelism at all the scales available to us. Multicore chips are common enough now that an application that used MPI between nodes and some else for the finer-grained parallelism within a node doesn’t introduce deal-breaking portability issues and orphan forks of the source tree.