Are benchmarks really useful?

Quick! Name the fastest supercomputer in the world. According to TOP500, it’s the BlueGene/L by IBM. This ranking was determined via a benchmark known as High-Performance Linpack, which performs a series of dense matrix computations. But one must wonder whether this is an accurate reflection of the true performance a computer can withstand in day-to-day applications. It does not measure availability or fault tolerance, nor does it measure the machine’s ability to handle irregular problems through cache misses, nor does it measure non-numerical applications in parallel or distributed systems.

Other benchmarks measure the machine’s middleware. The Intel MPI Benchmarks set the standard for bandwidth and latency evaluation, whereas SPEC’s OpenMP Benchmark Suite investigates SMP performance.

Then there are benchmarks that essentially encapsulate existing ones to present a more “balanced” picture, such as IPACS and HPC Challenge.

But all of these bring the original question back: do they really tell us something worth knowing? This kind of question could be aimed at software testing. As much as many quality assurance departments would hate to admit it, software testing can only reveal bugs; it cannot prove that a program is bug-free. Similarly, benchmarks cannot show that one system is truly better than another.

However, the software testing analogy does show where benchmarks are useful; they can tell system designers and customers when there is something wrong with the machine. That is, after building the system, if the benchmark results deviate greatly from the expected, then the designers know that there is something wrong, either with the architecture or with the implementation. Thus it would appear benchmarks really do serve a purpose in that they can point out when a newly created system does not match its intended functionality.