Whatever the purpose of a HPC system – from running diverse science tasks to having a single-purpose crunching through the same CFD code 24×7 – fundamentally it is nothing without the applications that use it.
The planes and engines that fly us safely, the lives changed by drug discovery, the races won, the scientific knowledge created – all stem from real software in the hands of real everyday users. Software determines what will be achieved from the finite resource that is the system.
Allinea Software is passionate about enabling better and more capable software for HPC. Our unified debugging and performance tools, Allinea DDT and Allinea MAP, make a difference to developers of HPC software every day – saving them time and helping them to create faster and more scalable codes.
But how do you check that, in production, software using a HPC system is using it efficiently? It is easy to confuse utilization with efficiency – but they are not the same.
We cheer system load graphs and thank resource managers and eager users when seeing core hours 95+% used – and yet within those core hours, how much do we know about the quality of that utilization?
Are applications stalling on file I/O? Spending more time in MPI communication than real work? Using the processor’s vector units – or missing out on over 90% of the available FLOP/s rate?
We talked to a number of HPC centers and companies using HPC about what really happens inside jobs running on their systems. We found that many did not have an answer – but all wanted one.
If you cannot measure it, you cannot improve it.” Lord Kelvin said, some 130 years ago.
Until now measurement has been hard to do. Each application required expert profiling, frequently with elongated source code instrumentation and recompilation and investigation phases.
At Allinea Software, we have just made the task easier through the release of Allinea Performance Reports. This tool provides a one page HTML report for a job – collecting, analyzing and reporting those key metrics that impact performance inside a normal run. It can be used without changing either the source code or the application – removing the barriers and opening access to everyone.
For a system owner or sponsor, the report helps target user support, user access, code development, configuration or hardware and system changes – enabling more effective, efficient, core hours.
For a system’s users, their budget – their core-hour allocations – limit the simulations or scientific results that can be achieved. Allinea Performance Reports inform the choices that can maximize the outcomes from that limit.
To us it seems vital to have good access to such previously hard to get information – it enables the critical analysis for improvement to flow.
To paraphrase a popular maxim: if we want to out-compete through out-compute, we must also smart-compute.
About the Author: David Lecomber is CEO and Founder of Allinea Software.