Dan Stanzione and Tommy Minyard from the Texas Advanced Computing Center [TACC] posted an article on Dell’s Enterprise Technology Center website about the perils of relying on pure clock frequency for performance comparisons of real applications. As many of you have undoubtedly read the various bi-annual releases of the latest Top500 numbers know that the rankings are based on the High Performance Linpack [HPL] benchmark. While sensitive to issues other than the core silicon, HPL generally delivers a high percentage of peak performance [when configured correctly]. But what about real world applications? How does this compare to my app?
Given that HPL delivers a pretty good fraction of peak performance on most processors, not surprisingly, higher clock rate has meant higher HPL, higher Top 500 number, and the impression that your new cluster is “faster.” The big gotcha here is the not-so-well-kept-secret in the HPC community that *peak* performance and *real* application performance didn’t really have that much to do with one another, and the performance of HPL did not reflect the ability of a cluster to get work done. This has become especially true with the last generation of new quad-core processors. [Minyard, Stanzione]
Stanzione and Minyard make a very important statement in the body of the article: “HPL doesn’t really suffer too much from inadequate memory bandwidth, so the magnitude of the problem hasn’t been quite as obvious.”
The duo from TACC go on to explain a series of benchmark runs performed on an Intel Harpertown [E5450] clocked at 3.0Ghz versus those performed on an Intel Nehalem [E5550] clocked at 2.66Ghz. The benchmark in question was the Weather Research Forecast [WRF] version 3.1.1 application [no specific details on which piece/variant]. The single core performance comparison between the processors was reasonably similar. However, at eight cores, the Nehalem was nearly four times faster than the [higher clocked] Harpertown.
So why the disparity? Its the operand, not the operation! The next comparison made by the duo involves the STREAM memory bandwidth. At this point, its clear who wins the memory argument and eventually crowned king of performance [in these tests].
Minyard and Stanzione do a great job in laying out a simple example of why clock speed is not the cat’s pajamas anymore. One thing I might add to this evaluation is the individual latency of a memory operation. Remember, the Harpertown was a classic Front-Side-Bus [FSB] architecture while a single memory controller handling requests from all devices. The Nehalem is a System-on-Chip [SOC] architecture with an integrated memory controller on each die. Long story short, the Harpertown has to work twice as hard to cover the latency of any given main memory operation.
For more info and some great graphs, check out their full writeup here.