A New Way to Visualize Performance Optimization Tradeoffs

Print Friendly, PDF & Email

Sponsored Post

Intel Advisor, one of a number of powerful tools that are part of Intel Parallel Studio XE 2018, can help identify portions of code that could be good candidates for parallelization and vectorization. Intel Advisor can also be used to determine where parallelizing a section of code might not be appropriate for the specific platform, processor, and configuration it’s running on.

A valuable feature of Intel Advisor is its Roofline Analysis1, which provides an intuitive and powerful visualization of actual performance measured against hardware-imposed performance ceilings. Roofline analysis helps answer the following questions:

  • Is the application running optimally on the underlying hardware? If not, what hardware resources are being underutilized?
  • What is limiting application performance – is the application memory or compute bound?
  • What is the right strategy to improve application performance?

Intel Advisor Roofline Analysis creates a chart that plots an application’s achieved floating-point performance and arithmetic intensity against the processor’s maximum achievable performance:

  • Arithmetic intensity (x axis) is measured in number of floating-point operations (FLOPs) per byte, based on the actual loop or function code, transferred between CPU/VPU and memory
  • Floating-point performance (y axis) is measured in billions of floating-point operations per second (GFLOPS)

The chart plots each loop in the code depending on its performance, and looks somewhat like this:

The size and color of each dot on the chart represent relative execution time for each loop or function in the code. Large red dots take the most time, so are the best candidates for optimization. Small green dots take less time, so may not be worth further optimizing. The dots link back into the source code for each loop or function.

The diagonal lines (Memory Cap: L1, Memory Cap: L2, Memory Cap L3) indicate memory bandwidth limitations preventing loops or functions from achieving better performance without some form of optimization. For example: The L1 Bandwidth roofline represents the maximum amount of work that can get done at a given arithmetic intensity if the loop always hits L1 cache. A loop does not benefit from L1 cache speed if a dataset causes it to often miss L1 cache, and instead is subject to the limitations of the lower-speed L2 cache it is hitting. So the dot representing the loop is positioned somewhere below the L2 Bandwidth roofline.

The horizontal lines (CPU Cap: FMAs, CPU Cap: Vector Add, CPU Cap: Scalar Add) indicate compute capacity limitations preventing loops or functions from achieving better performance without some form of optimization. For example, the Scalar Add Peak represents the peak number of add instructions that can be performed by a scalar loop under these circumstances. The Vector Add Peak represents the peak number of add instructions that can be performed by a vectorized loop under these circumstances. If a loop is not vectorized, the dot representing the loop is positioned somewhere below the Scalar Add Peak roofline.

In general, the greater the distance between a dot and the highest achievable roofline, the more opportunity exists for performance improvement.[clickToTweet tweet=”Intel Advisor Roofline Analysis provides a visualization of actual performance vs hardware-performance ceilings. ” quote=”Intel Advisor Roofline Analysis provides an intuitive and powerful visualization of actual performance measured against hardware-imposed performance ceilings. “]

Applications that perform close to the floating-point peak might be bounded by the compute capabilities of the current platform. Migrating to a highly parallel platform, such as the Intel Xeon Phi™, where the compute ceiling and memory throughput are higher, should be considered. On the other hand, if the performance is well below the compute ceiling, using an approach that better utilizes the vectorization capabilities of the processor should be considered.

Roofline analysis also exposes applications that are memory bound. To improve performance in this case consider improving the algorithm or its implementation to perform more computations per data item, or migrating to a processor with a higher memory bandwidth.

Intel Advisor’s vector parallelism optimization analysis and memory-versus-compute roofline analysis, working together, offer a powerful tool for visualizing an application’s complete current and potential performance profile on a given platform.

Intel Advisor is an integral part of both Cluster Edition and Professional Edition Intel Parallel Studio XE 2018.

Get  your free Download of Intel® Parallel Studio XE

(1) Roofline modeling was first proposed by University of California at Berkeley researchers Samuel Williams, Andrew Waterman, and David Patterson in the paper Roofline: An Insightful Visual Performance Model for Multicore Architectures in 2009. Reference: http://sips.inesc-id.pt/~ilic/roofline.php and A. Ilic, F. Pratas, and L. Sousa, “Cache-aware Roofline model: Upgrading the loft“, IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 21–24, Jan. 2014.