TACC On Memory Performance in a Cluster

Print Friendly, PDF & Email

taccDan Stanzione and Tommy Minyard from the Texas Advanced Computing Center [TACC] posted an article on Dell’s Enterprise Technology Center website about the perils of relying on pure clock frequency for performance comparisons of real applications.  As many of you have undoubtedly read the various bi-annual releases of the latest Top500 numbers know that the rankings are based on the High Performance Linpack [HPL] benchmark.  While sensitive to issues other than the core silicon, HPL generally delivers a high percentage of peak performance [when configured correctly].  But what about real world applications?  How does this compare to my app?

Given that HPL delivers a pretty good fraction of peak performance on most processors, not surprisingly, higher clock rate has meant higher HPL, higher Top 500 number, and the impression that your new cluster is “faster.” The big gotcha here is the not-so-well-kept-secret in the HPC community that *peak* performance and *real* application performance didn’t really have that much to do with one another, and the performance of HPL did not reflect the ability of a cluster to get work done. This has become especially true with the last generation of new quad-core processors. [Minyard, Stanzione]

Stanzione and Minyard make a very important statement in the body of the article: “HPL doesn’t really suffer too much from inadequate memory bandwidth, so the magnitude of the problem hasn’t been quite as obvious.”

The duo from TACC go on to explain a series of benchmark runs performed on an Intel Harpertown [E5450] clocked at 3.0Ghz versus those performed on an Intel Nehalem [E5550] clocked at 2.66Ghz.  The benchmark in question was the Weather Research Forecast [WRF] version 3.1.1 application [no specific details on which piece/variant].  The single core performance comparison between the processors was reasonably similar.  However, at eight cores, the Nehalem was nearly four times faster than the [higher clocked] Harpertown.

So why the disparity?  Its the operand, not the operation!  The next comparison made by the duo involves the STREAM memory bandwidth.  At this point, its clear who wins the memory argument and eventually crowned king of performance [in these tests].

Minyard and Stanzione do a great job in laying out a simple example of why clock speed is not the cat’s pajamas anymore.  One thing I might add to this evaluation is the individual latency of a memory operation.  Remember, the Harpertown was a classic Front-Side-Bus [FSB] architecture while a single memory controller handling requests from all devices.  The Nehalem is a System-on-Chip [SOC] architecture with an integrated memory controller on each die.  Long story short, the Harpertown has to work twice as hard to cover the latency of any given main memory operation.

For more info and some great graphs, check out their full writeup here.

Comments

  1. Robert L says

    Weird. Did TACC fall asleep two years ago and just wake up last week? Intel was giving talks in its booth at SC’08 showing that WRF was 3x faster on Nehalem than on equivalently clocked Harpertown.

    My question is that with core counts continuing to climb, and without another quantum leap in memory bandwidth in sight, will we just find ourselves back in the Harpertown scenario in a year or two? A paper answering *that* question would be a paper worth reading.

  2. Robert L, the answer to your question is not 100% clear, but falling much closer to ‘YES’. There are plenty of folks that preach “bandwidth, bandwidth, bandwidth.” This is very much true. You’re also correct that there doesn’t seem to be quantum leaps in memory bandwidth coming down the pipe in the near future. However, as bandwidth continues to creep up, we need to remember the ‘other’ memory performance factor: latency. The latency required to fetch a single cache line from a DRAM is reasonably high [as compared to cache and registers]. This latency has not improved over time at the same scale that transistor density has [eg, Moore’s Law]. The moral of the story is, we need to spend more time focusing on the memory performance of applications and architectures. This story includes two main chapters: memory bandwidth and memory latency [how to cover latency with outstanding operations and reducing the latency with new technologies].

    Do a google search on 3D Stacked DIMM technology. Interesting technology that might provide an interesting glimmer of hope.