During last month’s PRACE Days in Dublin – where I enjoyed talks on improvements in codes and methods in areas as diverse as CFD, RTM in geophysics, and in genomics – I saw once again that “hero” performance improvements happen and happen regularly.
Today’s processor, memory and I/O characteristics are complex – but there often are quick wins out there. There’s a buzz when you hear results like “2x speedup in just one afternoon” or “25% faster in three days” at conferences. The tragedy is that applications can run inefficiently for months before someone looks at their performance.
One stand-out talk at PRACE Days described a group’s work that led to a 6-fold speedup in a mission-critical industrial code. We were honored to see they chose Allinea Performance Reports to both guide their work and explain the results to the audience.
Their formula is simpler than a PDE: observation plus effort equals performance improvement.
Observation comes first: by benchmarking and characterizing applications (“observation”) we determine the right “effort” and find the wins faster.
- Is I/O the bottleneck – applications often have sub-optimal data access patterns or make incorrect assumptions about file systems – nothing is faster than doing better or less I/O.
- Are MPI or OpenMP throwing away half of your cycles? Consider how to run the same application differently or put development time into optimization.
- Is memory bandwidth your problem – main memory is around 10x slower to access than L1 cache. Can you restructure your memory access to improve it?
- Is compute using the AVX unit – vectorization gets 4x the double precision performance of a regular instruction?
That’s what we created Allinea Performance Reports to find out. It’s our low overhead, zero-configuration performance tool that measures and characterizes applications with no fuss.
It also gives guidance on the next steps to improve performance. For example, don’t fix the vectorization just yet if your real problem is higher up the list – it won’t help – you’ll just make your processor hungry for longer!
A new compiler, a compiler flag (do your users know that -O3 is not the fastest flag for production code?) or an afternoon hands-on with our profiler Allinea MAP can all be far more effective than hardware changes.
Supercomputing centers such as those at KTH in Sweden or FZ Juelich in Germany have many varied workloads – they are applying Performance Reports to a wide spectrum of applications – discovering those that are running well and those that are not.
Guided to the right applications – scientists and developers can then take to Allinea Forge, our HPC developers’ tool suite that includes our DDT debugger and MAP profiler.
Performance isn’t just a one-off exercise – another site has a development team that integrates benchmarking with Performance Reports into their continuous integration and testing regime – so that hard won improvements are not inadvertently lost!
HPC-dependent organizations sometimes struggle to focus limited resources on application performance – but who wouldn’t want to leapfrog the competition or increase the research contribution of their organizations immediately?
Quick wins are out there waiting to be found – so give your users the tools, and they’ll finish the job!
David Lecomber is the CEO of Allinea Software.