Optimizing HPC Code with Roofline Analysis

In this special guest feature, James Reinders describes why roofline estimation is a great tool for code optimization in HPC.

James Reinders, parallel programming enthusiast

Roofline Analysis is a technique that projects a view of realism into optimization targets. It lets us know when we’ve tuned all we can (assuming evolution of our code) which may uncover the unsettling fact that we need a new algorithm (revolution).

As a long-time teacher of optimization techniques, I can confidently say that Roofline analysis is a must-have for anyone optimizing for performance. This has not always been the case. As I will explain, today it is an important technique to draw upon when doing performance optimization.

When mentioning Roofline Analysis, I have been asked ‘Hasn’t that been around awhile?’, usually followed by ‘What’s new?’

Excellent questions. The answers revolve around two factors:

  1. complexities (latency hiding through parallelism and memory hierarchies) in optimizing for today’s processing architectures – including CPUs, GPUs, and accelerators of all kinds,
  2. new tools, based on new research, to help us deal with these complexities.

In the face of increasingly complicated systems, Roofline Analysis provides us with a step-by-step method to ascertain whether an algorithm has reached the end of its ability to provide more performance through continued optimization work.

Complexities in optimizing for today’s systems

Today we are faced with a great diversity of compute devices, ranging from Intel Xeon scalable processors, and GPUs, to more application-specific accelerators enabled by FPGAs and ASIC technologies.

It’s not the diversity that demands Roofline analysis, it’s the complexity of the architectures of the individual devices. Specifically, it is their complex abilities to hide latencies, and the sophisticated parallel compute capabilities and multilevel memory subsystems that play critical roles in such latency hiding. Years ago, performance optimization was successful if we could reduce the number of instructions being executed. Such optimizations were nearly always rewarded by performance improvements. That is not the case today. Fortunately, Roofline analysis addresses these complications in optimization work.

New tools, new research, how to cope

The technique of Roofline analysis has recently seen a surge in study, resulting in some interesting papers and tutorials. Throughput optimization techniques tend to be effective everywhere. Therefore, tuning investments using roofline analysis done on an Intel Xeon Scalable processor-based server, where the development environments are rich and mature, will lead to optimizations that help other compute devices. We can choose whatever environment with which we are most comfortable, and wherever a tool happens to run best, to get the most important tuning work done to improve throughput.

When roofline confirms our fears (but reduces futile optimization attempts)

Roofline analysis can hint that we should find a new algorithm in two ways:

(1) It reveals that the arithmetic intensity (AI) is low, therefore the peak capabilities are not well utilized. We may find ourselves needing to find an algorithm that can get closer to peak performance, when optimizations to the current approach fail to be possible in critical parts of our application.

(2) It reveals that AI is high, but performance falls short of what we need, want, or believe should be possible. Only an algorithmic change can give us better performance on a machine, if we are already close to a machine’s peak performance.

If this seems a bit circular, you are right. When we have low-AI, we seek to make it high-AI, through algorithmic change if optimization is not possible. No matter how we reach high-AI, we are faced with the need for algorithm change to go further.

Being told we need to rewrite using a new algorithm is not necessarily welcome news. The good news about the Roofline analysis technique is that it clarifies for us whether these needs are truly present. Knowing that can prevent a lot of time vainly spent seeking optimizations that simply do not exist. An example of this is ‘reducing cache misses’. Specific ‘stall’ event monitoring counters (emon counters) added to Intel processors (with Intel Xeon Scalable processors offering the greatest support in quantity and diversity), allow tools to find cache misses that are actually causing delays (stalls) and therefore causing lower-AI.

Roofline analysis can incorporate stall information into its technique, helping us avoid chasing optimizations that do not improve performance. I cannot overstate how valuable this is!

Intel automated much of the tedious work in doing a Roofline analysis

Intel has implemented Roofline analysis into a feature in its Intel Advisor tool (free versions available) so we can explore our own applications, and get concrete feedback on application-specific bottlenecks.

Sophisticated, and easy-to-use instrumentation, it relies on strong support for stall accounting present in Intel processors, with the broadest capabilities being in the Intel Xeon Scalable processors found in servers and supercomputers.

I highly recommend a variety of reading material from Berkeley Labs, and the Intel Advisor tools including some excellent tutorials on its usage.

James Reinders is a Parallel Programming and HPC expert with more than 27 years’ experience working for Intel until his retirement in 2017. Reindeers is the author of eight books in the HPC field in addition to numerous papers and blogs. 

This story appears here as part of a cross-publishing agreement with Scientific Computing World.

Sign up for our insideHPC Newsletter