Vectorization Now More Important Than Ever

Intel VTune Amplifier

Sponsored Post

Intel Xeon Phi ProcessorsVectorization, the hardware optimization technique synonymous with early vector supercomputers like the Cray-1 (1975), has reappeared with even greater importance than before. 40+ years later, the AVX-512 vector instructions in the most recent many-core Intel® Xeon and Intel Xeon Phi™ processors can increase application performance by 16x for single-precision codes.

Employing a hybrid of MPI across nodes in a cluster, multithreading with OpenMP* on each node, and vectorization of loops within each thread compounds the individual performance gains. With most application codes executing slower on the latest energy-efficient processors if run sequentially, the full performance potential of today’s processors can only be achieved through multithreading and vectorization. It’s the improved power efficiency of these processors that forces individual cores to run purely sequential code slower.

Benchmarks have shown that combining vectorization with multithreading can result in a considerable performance boost over just vectorization or multithreading alone. Indications are that this boost will get even greater with each new hardware generation. The trend toward exascale computing is all about power-efficiency.

When vector machines like the Cray-1 were introduced, programmers were motivated to vectorize the most productive loops in the scientific/engineering applications that ran on them. Vector architectures expande data registers to hold more than one operand, and a single instruction working over multiple operands produces multiple results.

Today, Intel AVX-512 instructions can produce eight results, compared to four for Intel AVX, or two for Intel Streaming SIMD Extensions (Intel SSE). The Intel Xeon Phi x200 processor supports Intel AVX-512 instructions for a wide variety of operations on 32- and 64-bit integer and floating-point data.

[clickToTweet tweet=”Vectorization now more important than ever on the latest Intel processors.” quote=”Vectorization now more important than ever on the latest Intel processors.”]

Developers typically rely on the compiler’s automatic loop analysis capabilities to generate vectorized object code utilizing the AVX/SIMD instructions. Still, not all loops can or will be vectorized. The compiler will detect interdependencies between loop iterations that could give wrong results. The Intel compilers, for example, report on loop vectorization and diagnose why it could not compile a loop using vector instructions. The developer is left to either restructure the loop to avoid those interdependencies, or replace the loop with a highly optimized library routine from the Intel Math Kernel Library (Intel MKL) to accomplish the same task.

The analysis and restructuring of complex loops in C, C++, and Fortran can be difficult and is mostly by trial-and-error. The same can be said about multithreading an application. Knowing how and where to add OpenMP or Threading Building Blocks (TBB) directives around and within loop constructs requires another level of expertise.

With the latest release of Intel Advisor, Intel Math Kernel Library, and the optimizing Intel compilers, all part of Intel Parallel Studio XE 2018, the process has become easier and more productive for developers on Intel platforms.

Using Intel Advisor, you can survey all the loops in your application and see:

  • Which loops were vectorized and which loops were not
  • What prevented vectorization for the non-vectorized loops
  • Speedup and vectorization efficiency for the vectorized loops
  • Issues that decreased efficiency of the vectorized loops
  • Which vectorized and non-vectorized loops were limited by the memory layout

Where the compiler could not vectorize a loop, Intel Advisor indicates what issues need to be resolved, with recommendations for improving the source code. Typically, a loop might poorly vectorize due to the way data is arranged in memory, causing cache or memory access conflicts. Intel Advisor will point this out and suggest changes.
Intel Advisor is rich with diagnostic and interactive analysis features that focus on vectorization as well as threading with TBB or OpenMP.  Using Intel Advisor allows you to easily explore alternative strategies and make tradeoffs by accurately projecting performance and identifying bottlenecks.

With vectorization and multithreading now more important than ever, Intel Parallel Studio XE 2018 provides developers an efficient way to get faster parallel performance with the latest Intel platforms.

Get your free download of Intel® Parallel Studio XE 2018