Intel MKL Compact Matrix Functions Attain Significant Speedups

Print Friendly, PDF & Email

Sponsored Post

Most traditional high-performance computing applications focus on computations on very large matrices. Think seismic analysis, weather prediction, structural analysis. But today, with advances in deep learning, computer vision, and autonomous systems, many applications now depend on matrix computations performed on many groups of very small matrices, which doesn’t fit efficiently into the traditional model.

The latest version of Intel® Math Kernel Library (MKL) offers vectorized compact functions for general and specialized matrix computations of this type. These functions rely on true SIMD (single instruction, multiple data) matrix computations, and provide significant performance benefits compared to traditional techniques that exploit multithreading but rely on standard data formats.

The latest version of Intel MKL adds six compact functions:

  1. General matrix-matrix multiply
  2. Triangular matrix equation solve
  3. LU factorization (without pivoting)
  4. Inverse calculation (from LU without pivoting)
  5. Cholesky factorization
  6. QR factorization

Intel MKL compact functions pack matrices into a contiguous segment of memory in an interleaved data layout organized into packs where the length is related to the register length of the underlying architecture and the size of the matrix elements. Each pack is, in effect, a 3D tensor, with the matrix index incrementing fastest. These compact packs are then loaded into registers and operated on using SIMD instructions.

Complex matrices are packed with the real and imaginary parts separated and alternating in memory. Since most arithmetic operations on complex elements can be expressed in registers without extra operations, this packing format is especially suitable for complex functions.

Intel MKL provides utility functions for packing and unpacking matrices into and out of this compact format. There are functions to calculate the optimal format for the underlying architecture and the size of the necessary buffer to store the compact arrays which are then used as parameters for the main packing and unpacking functions.[clickToTweet tweet=”Simulations achieve 6X speedup using compact matrix functions in Intel MKL.” quote=”Simulations achieve 6X speedup using compact matrix functions in Intel MKL.”]

While packing and unpacking matrices to and from this compact format adds additional overhead, computations with compact functions offer the greatest performance benefit when multiple computations on these packed matrices are chained and performed in sequence, unpacking only at the end.

Intel MKL Compact matrix functions are optimized for 128-, 256-, and 512-bit SIMD registers. Compact packs are processed individually by function kernels that are vectorized intrinsics, depending on the specific instruction set. Calculations on separate compact packs are inherently independent and easily threaded. Intel MKL Compact functions implement this naïve form of parallelism behind the scenes, offering improved performance results when calling compact functions for multiple threads.

There are many potential applications for Compact BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package) functions. Computer vision requires dense linear algebra operations on large groups of very small matrices. For example, anomaly detection in images requires simultaneously solving thousands of dense linear systems using their Cholesky factorizations, which are independent, and strong candidates for speedup with Intel MKL Compact functions.

Partial differential equation (PDE)-based simulations over a mesh demonstrate up to 6X speedup of a linear solver for compressible dynamics simulations using compact matrix-matrix multiplication, triangular solver, and LU factorization.

Another typical use case is calculating the inverse from a non- pivoting LU factorization. Even when including overhead in performance, the Intel MKL Compact functions still provide consistently good speedup, with some configurations demonstrating up to 4X speedup compared to calls to the generic Intel MKL functions.

Intel Math Kernel Library is an integral part of Intel Parallel Studio.

Get your free download of Intel® Math Kernel Library