In this special guest feature from Scientific Computing World, Adrian Giordani asks what benchmarks should be applied as the nature of supercomputing changes on the way to exascale.
In July, the latest edition of the Top500 list, which ranks the most powerful supercomputers in the world, will be published. There is little room to doubt that, as in the list published in November 2014, the number one spot will be held by the Tianhe-2 based at the National Super Computer Centre in Guangzhou, China. The system has a theoretical performance that is 419,102 times faster than the fastest systems available when the first Top500 list was published way back in 1993.
The Top500 bi-annual list uses the widely accepted Linpack benchmark to monitor the performance of the fastest supercomputer systems. Linpack measures how fast a computer will solve a dense n by n system of linear equations.
But, the tasks for which high-performance computers are being used are changing and so future computations may not be done with floating-point arithmetic alone. There has been a growing shift in the practical needs of computing vendors and the supercomputing community. Other benchmarks have emerged. The Green500 lists supercomputers in terms of energy efficiency: as Linpack flops-per-watt, for example.
Nonetheless, the goal for the HPC community is to create an Exaflop machine, and the measure of Exaflop (a billion billion calculations a second) derives from the Linpack way of thinking. The first Exaflop system may be within reach within the next decade provided novel computing architectures are introduced to deal with power issues, which is the primary design constraint.
But, calling the system ‘exa-anything’ is a bad idea because it sets the scientific computing community up for perceived failure if Exaflops are not reached, according to Horst Simon, Deputy Director of the US Lawrence Berkeley National Laboratory.
Exascale will provide the computational power needed to address the important science challenges, but that capability will come at an expense of a dramatic change in architectures, algorithms, and software,” said Jack Dongarra — the man who introduced the original fixed-size Linpack benchmark and one of the first contributors to the Top500 list. Dongarra is currently based at the Innovative Computing Laboratory at the University of Tennessee.
Energy and data movement
Energy efficiency is one of the crucial factors that Dongarra cites. On current trends, the power needed to get to Exascale would be unaffordable. Other factors come into play too, such as movement of data, and new applications, for example to tackle simulations of the human brain.
We started seeing the end of Moore’s law about ten years ago. It’s a little like the statement: ‘The world will run out of fossil fuels by 2025’, said John Gustafson, an accomplished expert on supercomputer systems and creator of Gustafson’s law (also known as Gustafson-Barsis’ law) in computer engineering. That’s not what happens. It just gets more and more expensive to keep going the way we’re going, and the economics will lead us to alternatives.”
If you take Moore’s law, which applies to transistors, it offers no improvements to the old-fashioned wiring connections between components, argues Gustafson. Today, the bottleneck within the architecture is that memory transfer takes longer to complete than a floating-point arithmetic operation — in some cases memory is 2,300 times slower, while transistor logic has improved by a factor of trillions.
Imagine a power cable as thick as a horse’s leg next to a wire just a millimeter in diameter,’ said Gustafson. “It is easy to guess that the power cable consumes maybe a thousand times as much energy as the wire. The ratio is similar between “on-chip” transistor arithmetic connections and the connections that go “off-chip” to memory. I sometimes tell people combining today’s ultra-fast arithmetic units with a typical memory system is like mounting a large V8 gasoline engine on a tricycle and expecting it to be a high-performance vehicle.’ Running costs for systems associated with this bottleneck inevitably increase.
High bandwidth and low latency
In addition to this problem of data-transfer latency, applications that require more complex computations have become more common. These calculations require high bandwidth, low latency, and data access using irregular patterns – something Linpack cannot test.
In 2012, the Cray Blue Waters system at the National Center for Supercomputing Applications, University of Illinois in Urbana-Champaign, US, refused to submit an entry to the TOP500 list. Blue Waters Project Director Bill Kramer said that the benchmark did not give an indication of real sustained performance and value, and was perhaps doing detriment to the community.
In a sense the issue is that no single computational task can ever reflect the overall complexity of applications that run within a supercomputer architecture. Jack Dongarra is well aware of the imbalance that is being created and the need to address today’s data-intensive applications. An alternative benchmark he has proposed could better compare computation to data-access patterns. In 2013, Dongarra talked about a new benchmark called the ‘High Performance Conjugate Gradient’ (HPCG), which synchronizes the benchmark to applications that use differential equations.
HPCG is getting people to see that “real applications” in general fall far short of the peak performance or the Linpack performance numbers,” said Dongarra.
This was well known before this test, with many papers available on the subject. HPCG is trying to catch up to what is known, said Kramer.
But, since HPCG is not a real application it cannot speak for them; it is just a test of other architectural features — it is not clear whether it is proportional to application performance overall,” said Kramer.
Dongarra hopes there will be an effort to optimize both hardware and algorithms to work better together.
Kramer said that HPCG is an important step forward to improve the benchmarking situation, but it is insufficient as a single measure of extracting meaning about system and application performance.
Performance evaluation experts describe a term called proportionality, which means if the metric increases by 20 per cent, then the real performance of the system should increase by a similar proportion. This cannot be represented by a single test.
My attitude hasn’t changed — Linpack is useful as a diagnostic; but, it’s a very limited indicator of supercomputer performance, particularly within the rules of the Top500 list. A metric has to represent complex components for applications; think of a better metric that comprises enough tests to represent all the major algorithms and uses, such as a composite measure. This will benefit the community,” said Kramer.
A composite metric would test 7 to 12 application characteristics, for example, and is a more accurate and representative measure, according to Kramer. These tests generate representative numbers which would feed into an overall metric of how well the software problems are solved.
The Standard Performance Evaluation Corporation (SPEC) benchmark is one such metric that enables computer scientists to understand the in-depth behaviour of a system as a whole, when running an application. Kramer recalled a recent presentation by which NASA used SPEC performance to get a realistic idea of how their systems perform.
However, due to SPEC’s fixed-size approach and the way it was originally designed for single-processor workstations, it cannot maintain a single performance database stretching back over two decades for performance trends.
The Sustained Petascale Performance (SPP) metric is another tool used on the Cray Blue Waters system by Kramer and his colleagues so they can get a more detailed understanding of each application’s performance, workload and the overall sustained performance of the entire system.
The fact is if a new benchmark is introduced, that is drastically different than today’s, it is understandable that many will hesitate to give the current one up and invest in a brand new benchmark. It would take decades to get the same metrics.
The funds for supercomputers are usually provided by governments who want to show a prominent place in computing, as measured by the Top500 ranking. Even though the procurement people often know better, they have little choice but to pursue a system that is cost-effective for Linpack, but very cost-ineffective for everything else,” said Gustafson.
More radical approaches
Gustafson’s approach is to ‘boil the ocean’ and get people to change the way they think about the purpose of computing, and how they measure results, speed, and the quality of answers obtained. Resources could be focused on increasing the communication speed between servers, or within the server, instead. Then the speed of real applications would increase proportionately.
Gustafson has two potential solutions, one of which is the Fastest Fourier Transform (FFT) benchmark, which could be used as part of a larger composite metric as Kramer describes. FFT has historical data that goes back to the 1960s and is scalable. Unlike Linpack, it is communication-intensive and more representative of real, present-day workloads. But, the catch is someone has to do the hard work of mining all the historical FFT benchmark history to create a useable database comparable to the Top500.
A more fundamental issue according to Gustafson is that Linpack is a physicist’s measure: great for testing numerical calculation capability, which is what physicists wanted supercomputers for originally. But, science’s shift to a more data-centric approach in its computing needs has also been mirrored by the rise of commercial ‘Big Data’.
In his book The End of Error: Unum Computing The Future of Numerical Computing, published by CRC Press in February this year, Gustafson advocates a radically different approach to computer arithmetic: the universal number (unum). The unum encompasses all IEEE floating-point formats as well as fixed-point and exact integer arithmetic. Gustafson believes this new number type is a ‘game changer’. It obtains more accurate answers than floating-point arithmetic yet uses fewer bits in many cases, saving memory, bandwidth, energy, and power.
Efficiency improvements can be anywhere between 1.2 to 8 times depending on the application, according to Gustafson. A unum library created in C, C++ or Python could provide vendors with access to these benefits.
Such ideas could nudge the benchmark measuring system for supercomputers and their applications on to a very different track. Measuring a supercomputer’s speed is going to remain a challenge for some time to come.