More Than Ever, Vectorization and Multithreading are Essential for Performance

Print Friendly, PDF & Email

Sponsored Post

Lately, vectorization, an optimization technique synonymous with early vector supercomputers like the Cray-1 (1975), has reappeared with even greater importance than before. Exploiting the performance benefits of the AVX-512 vector instructions on the most recent many-core Intel Xeon and Intel® Xeon PhiTM processors can increase application performance by 16x for single-precision codes.

Employing a hybrid of MPI across nodes in a cluster, multithreading with OpenMP* on each node, and vectorization of loops within each thread results in multiple performance gains. In fact, most application codes will run slower on the latest supercomputers if they run purely sequentially. This means that adding multithreading and vectorization to applications is now essential for running efficiently on the latest architectures.

Certain benchmarks have shown that combining multithreading and vectorization on the latest processors can result in a considerable performance boost (sometimes more than 100x) over just vectorization or multithreading alone. Indications are that this boost will get even greater with each new hardware generation. The trend toward exascale computing is all about power-efficiency. The NERSC center at the Lawrence Berkeley Lab notes that on its Cori supercomputer, a Cray XC40 cluster based on the latest Intel Xeon Phi processors, programs will run slower if programmers do nothing with their code. Because of the increased power efficiency, individual cores require more time to run sequential code.

When vector machines like the Cray-1 were introduced, the scientific/engineering application codes that ran on them were motivated to vectorize the most productive loops in those codes. Vector architecture meant that data registers could hold more than one operand, and a single instruction could produce more than one result. Today, Intel AVX-512 instructions can produce eight results, compared to four for Intel® AVX, or two for Intel® Streaming SIMD Extensions (Intel® SSE). The Intel Xeon Phi x200 processor supports Intel AVX-512 instructions for a wide variety of operations on 32- and 64-bit integer and floating-point data.

[clickToTweet tweet=”Benchmarks show that hybrid threaded and vectorized code run more than 100x faster on the latest Intel processors.” quote=”Benchmarks show that hybrid threaded and vectorized code run more than 100x faster on the latest Intel processors.”]

Generally, developers rely on the automatic loop vectorization capabilities of the compiler. Still, not all loops will be vectorized. The compiler will detect interdependencies between loop iterations that could give wrong results. The Intel compilers, for example, can report on loop vectorization and diagnose why it could not compile the loop using vector instructions. Typically, the developer’s recourse has been to either restructure the loop to avoid those interdependencies, or replace the loop with a highly optimized library routine to accomplish the same task.

The analysis and restructuring of complex loops in C, C++, and Fortran has always been difficult and mostly trial-and-error. The same can be said about multithreading an application. Knowing how and where to add OpenMP or Threading Building Blocks (TBB) directives around and within loop constructs requires another level of expertise.

But the latest release of Intel Advisor, Intel Math Kernel Library, and the optimizing Intel compilers, all part of Intel Parallel Studio XE 2017, makes the process easier and more productive for developers on Intel processors.

Using Vectorization Advisor, you can survey all the loops in your application and see:

  • Which loops were vectorized and which loops were not
  • What prevented vectorization for the non-vectorized loops
  • The speedup and vectorization efficiency for the vectorized loops
  • Any issues that decreased efficiency of the vectorized loops
  • The vectorized and non-vectorized loops that were limited by the memory layout

In those cases where the compiler could not vectorize a loop, Intel Advisor will indicate issues that need to be addressed, and give recommendations for improving the code. Typically, a loop might vectorize, but poorly, due to the way data is arranged in memory, causing cache or memory access conflicts. Intel Advisor will point out these issues and suggest changes.

Intel Advisor is rich with diagnostic and interactive analysis features that focus on vectorization as well as threading with TBB or OpenMP.  Using Intel Advisor allows you to easily explore alternative strategies and make tradeoffs, resulting in better design decisions by accurately projecting performance and identifying bottlenecks. This is an efficient way to get faster parallel performance while avoiding costly design errors.

Download your free 30-day trial of Intel® Parallel Studio XE 2018