Intel Parallel Studio XE AVX-512: Tuning for Success with the Latest SIMD Extensions and Intel® Advanced Vector Extensions 512

Print Friendly, PDF & Email

Sponsored Post

In High Performance Computing applications, developers always desire to get the most out of the underlying hardware. As the capabilities of the hardware continue to expand, it is important that applications take advantage of these capabilities if possible.

With the introduction of Intel Parallel Studio XE, instructions for utilizing the vector extensions have been enhanced and new instructions have been added. Applications in diverse domains such as data compression and decompression, scientific simulations and cryptography can take advantage of these new and enhanced instructions.

[clickToTweet tweet=”Intel Studio 18.0 contains new ISA for Intel Xeon Scalable processors.” quote=”Tweet about Intel Studio 18.0 compilers.”]With the introduction of the Intel Xeon Scalable processors there is new Intel AVX-512 Instruction Set Architecture support. The list of new support includes:

  • Intel AVX-512 foundation:
  • 512-bit vector width
  • 32 512-bit long vector registers
  • Data expand and compress instructions
  • Ternary logic instruction
  • Eight new 64-bit long mask registers
  • Two source cross-lane permute instructions
  • Scatter instructions
  • Embedded broadcast/rounding
  • Transcendental support
  • Intel AVX-512 double- and quad-word Instructions (DQ): QWORD support
  • Intel AVX-512 byte and  word instructions (BW): Byte and Word support
  • Intel AVX-512 Vector Length Extensions (VL): Vector length orthogonality
  • Intel AVX-512 Conflict Detection Instructions (CDI): Vconflict instruction

Developers are always challenged with tuning applications to get more performance. In many cases, determining through understanding the application as well as using compiler directives, developers can vectorize more of the code, which will result in more speedup of the application.  However, developers must be sensitive to the fact that the application may need to run on older systems that do not contain the latest instruction set. Intel recognizes this and allows a developer using Intel Parallel Studio XE to compile with more than one target ISA in mind. For example, the application can be targeted at the older AVX2 instruction set as well as the new AVX-512 instruction set at the same time.  Using the flag, -axtarget, at compile time different targets can be specified, such as   -axCORE-AVX512, CORE-AVX2. This will also generate generic code that would be able to run efficiently on any SSE2 ISA system as well.

For popular languages such as C/C++ and Fortran, Intel is innovating with new implementations for OpenMP 4.0. This innovation has led the the creation of new idioms, such as Compress and Expand, Histograms, Conditional Last Private, and Loops with Early Exit. By using these options performance increases in the 1.5X and higher have been measured for Compress and Expand, as well as Histogram examples.

While understating the exact effect of using new capabilities with new generations of hardware, techniques that have been used previously to compile, compare, indentify hot spots and then iterate still hold. By comparing the performance using different –xCORE-AVXx flags will give insight as to whether the application can benefit by using the new instructions. Setting a baseline is important for ongoing comparisons.  Identifying and understanding hotspots allows a developer to understand how the application can improve by focusing on specific code areas.

The Intel AVX-512 support in the Intel Compilers 18.0 for Intel Xeon Scalable processors can give quite a boost to HPC applications.  Although microkernels can demonstrate the effectiveness of the new SIMD instructions, understanding why the new instructions benefit the code can then lead to even greater performance.

Get your free 30-day trial – Intel® Parallel Studio XE