Vectorization with AVX-512 Intrinsics

Print Friendly, PDF & Email

Sponsored Post

One of the main benefits to using new generation of hardware for high performance computing is that the new systems can not only speed up performance through the use of more cores, but also that new types of instructions can be used. As is the case with the Intel Advanced Vector Extensions 512-bits (AVX-512  instructions that are native to the Intel Xeon Phi processor, it is important that applications have easy access to what could potentially a tremendous performance boost.

The AVX-512 instructions are easily available to programs written in C, C++ and Fortran.

Of course the first choice would be to have the compiler recognize and generate AVX-512 instructions because it is easier and make the code more portable. However, there will be times where the compiler cannot recognize when these instructions can be used.

[clickToTweet tweet=”Use intrinsics when working with AVX-512 instructions on the Intel Xeon Phi processor” quote=”Use intrinsics when working with AVX-512 instructions on the Intel Xeon Phi processor”]

In these cases, there are AVX-512 intrinsics available that can be used directly from either a C or C++ application. Fortran users will have to go through a C interface. Intrinsics look like function calls, but they do not generate a function call. Rather, there is a direct correspondence to the lower level SIMD instructions that are available in the Intel Xeon Phi processor hardware. A very large advantage to using intrinsics is that the developer can get the performance of lower level capabilities with out  having to resort to writing assembly code. This can be a huge advantage when using the Intel Xeon Phi processor as some of the performance can be gained with little effort and an a manner that developers are accustomed to. Developers can concentrate on their application and not have to deal with scheduling and register allocation.

With the Intel compilers, intrinsics are recognized and the instructions are generated in-line which is a tremendous advantage. Since the Intel Xeon Phi processor when using the AVX-512 intrinsics can perform a tremendous number of floating point operations per second, it is beneficial to use intrinsics for certain math computations. To use intrinsics, all that is needed is the proper header file and then to call the desired intrinsic function.

As a simple example let’s look at adding together two array containing 16 floating-point numbers. First we need to load 16 floating point values from memory, then the second set of 16 floating point values, and finally add them together.  Assume that a and b were previously defined.

_md152 simd1 = _mm512_load_ps(a);  // read 16 floats into a

_md512 simd1 = _mm512_load_ps(b);  // read another 16 floats into b

_m512 simd3 = _mm512_add_ps(simd1, simd2);  //add them together

For developers that want to take a look at the instructions that are generated by the compiler, use the –S option on the compiler. Developers can learn what instructions are being generated.  An added benefit to using these intrinsics is that the compiler will automatically align the _m512 to a 64 byte boundary which significantly helps performance.

A number of compilers support these intrinsics today. In the past, Intel has increasingly sophisticated instructions, which compilers have supported. As the new Intel Xeon Phi processor with the AVX-512 instructions becomes more widely available, it is expected that more compilers will support these instructions with easy to use intrinsics.

Download your free 30-day trial of Intel® Parallel Studio XE