An often heard remark after deploying a production application to a new platform is “I was expecting it to run much faster.”
There can many reasons for this, even though the latest platforms keep expanding the number of their compute cores. We expect software that has been well parallelized to take advantage of more cores would demonstrate increased performance. However, a number of factors can limit an application’s scalability and overall parallel performance on multicore systems. In most cases it’s the way the parallelism has been implemented in the code that typically leads to situations such as:
- Load imbalances that idle threads and cores.
- Excessive synchronization requests resulting in wasted CPU time spent waiting on locks.
- Runtime library overheads due badly designed API usage.
But even when steps are taken to keep all the cores busy doing useful work, you might notice that the performance begins to plateau as the cores in use increases beyond a certain number. This could mean the application is now I/O bound, or busy consuming some other resource. Or not enough physical memory is available. Or, more often, the program is now memory bound––the CPUs are requesting data faster than the memory system can retrieve it.
Discovering where the performance bottlenecks are and knowing what to do about it can be a mysterious and complex art, needing some very sophisticated performance analysis tools for success. That’s where Intel® VTune™ Amplifier XE 2017 comes in. One of the Intel Software Development Tools for high-performance and part of Intel Parallel Studio XE, Intel VTune Amplifier collects real-time performance data for the application being analyzed and interprets the results to identify where and how your application can be tuned to improve utilization of all available hardware resources.Intel® VTuneTM Amplifier XE analyzes complex code and helps to rapidly identify bottlenecks. Click To Tweet
Developers typically use Intel VTune Amplifier to gain insight into the following:
- Where are the most time-consuming (hot) functions in the application?
- Which sections of code do not utilize available processor time effectively?
- Which sections of code to choose to optimize for better sequential performance, and which for better threaded performance?
- Are synchronization objects affecting application performance?
- Where is the application wasting time on inefficient input/output operations?
- Could changing synchronization methods, number of threads, or different algorithms improve overall performance?
- Is thread activity and transitions affecting performance?
- Are there hardware-related issues in your code, such as data sharing, cache misses, branch mis-prediction, memory latency, etc., affecting performance?
The key to meaningful performance analysis is the ability to collect accurate and just-as-meaningful runtime performance profiles. Here’s where running Intel software tools on Intel hardware has great advantages, because Intel processors have an on-chip performance monitoring unit (PMU). While its basic hotspot analysis works on both Intel and compatible processors, Intel VTune Amplifier XE advanced hotspots analysis on Intel processors uses the PMU to collect precise data with very low overhead. Increased resolution (approximately 1 MS versus approximately 10 MS) means it can find hotspots even in short functions.
One of Intel VTune Amplifier’s many features is its HPC Performance Characterization analysis, which helps identify how effectively compute-intensive applications use CPU, memory, and floating-point resources. The HPC Performance Characterization analysis can be used as a starting point for understanding the performance aspects of complex HPC applications. Additional scalability metrics are available for applications that use Intel OpenMP* or Intel MPI runtime libraries. This analysis can be run from within the VTune Amplifier GUI or from the command line.
Another specialized feature is Intel VTune Amplifier’s Memory Access analysis, which identifies memory-related syndromes such as NUMA , bandwidth-limited accesses, and cache misses. It can attribute performance events to memory objects (data structures) provided through instrumentation of memory allocations/de-allocations and fetching static/global variables from symbol information. Memory Bound metrics show the fraction of cycles spent waiting due to demand load or store instructions.
The 2017 release of Intel VTune Amplifier XE includes support for the latest Intel Xeon Phi™ (codenamed Knights Landing) coprocessor, as well as Intel Atom™ (codenamed Apollo Lake and Denverton) and the Intel Kaby Lake processors.
Intel VTune Amplifier XE runs on Microsoft Windows and various Linux OS variants, and supports C, C++, C#, Fortran, Java*, Python*, Go*, ASM assembly, OpenMP, Intel Threading Building Blocks, MPI, OpenCL, and more, along with compilers from Microsoft, GCC, Intel, and others. It is fully integrated into Microsoft Visual Studio*, and is distributed standalone or as an integral part of Intel Parallel Studio XE.