Intel AVX Gives Numerical Computations in Java a Big Boost

Print Friendly, PDF & Email
Intel VTune Amplifier

Sponsored Post

Recent Intel® enhancements to Java enable faster and better numerical computing.  In particular, the Java Virtual Machine (JVM) now uses the Fused Multiply Add (FMA) instructions on Intel Intel Xeon® PhiTM processors with Advanced Vector Instructions (Intel AVX) to implement the Open JDK9 Math.fma() API.  This gives significant performance improvements for matrix multiplications, the most basic computation found in most HPC, Machine Learning, and AI applications.

Intel AVX instructions perform SIMD vector operations, such as FMA vector times vector plus vector, or  A = (A*B)+C. In linear algebra-based ML algorithms, deep learning and neural networks (dot product, matrix multiplication), financial and statistical computation models, and polynomial evaluations, FMA computations predominate. Here, the JVM JIT compiler maps FMA operations written in Java to Intel AVX FMA extensions, if available, on the underlying CPU processor.

With the release of Open JDK9, the FMA API appears within the java.lang.math package as intrinsics that directly map FMA Java routines to the Intel AVX FMA extensions on Intel Xeon Phi and Intel Xeon Platinum 8180 processors. No additional work from the developer is required.

The FMA intrinsic routine returns the fused multiply-add of its three arguments: the exact product of the first two arguments, summed with the third argument, and then rounded once to the nearest double-precision value. The FMA operation is performed using the java.math.BigDecimal class.[clickToTweet tweet=”Java Math.fma fully leverages Fused Multiply-Add using AVX extensions on the latest Intel processors.” quote=”Java Math.fma fully leverages Fused Multiply-Add using AVX extensions on the latest Intel processors to boost Java numerical computation performance.”]

The FMA API takes floating-point inputs a, b, and c and returns floating-point type. It supports both
 single- and double-precision. The FMA computation is performed in double-precision. If all inputs are within finite range, the following expression computes the product of floating-point inputs a and b by explicitly casting them into BigDecimal objects:

BigDecimal product = (new BigDecimal (a)).multiply (new BigDecimal (b));

When running Java applications on the latest Open JDK 9 release and source builds, the JVM enables hardware-based FMA intrinsics for those processors where FMA instructions are available. FMA intrinsics are generated for the java.lang.Math.fma(a,b,c) methods in JDK9 to calculate a*b+c expressions.

Computational kernels in Java applications implemented using Math.fma fully leverage FMA AVX extensions on the latest Intel processors. On the latest Open JDK, BLAS-I DDOT (Dot Product) performance can improve up to 3.5X on Intel Xeon Phi processors using Math.fma. The JVM JIT compiler transforms the FMA method calls into hardware instructions, which are further vectorized through auto-vectorization and super-word optimizations into SIMD operations. BLAS-I DAXPY performance can improve up to 2.3X on Intel Xeon Phi processors using the Math FMA intrinsics from the latest Open JDK source builds.