Heterogeneous Streams with Intel Xeon Phi

Matrix multiplies can be decomposed into tiles and executed very fast on the latest generations of coprocessors. Intel has developed the hStreams library that supports task concurrency on heterogeneous platforms. The concurrency may be across nodes (Xeon, KNC, KNL-SB, KNL-LB); within a node for small matrix operations; and in the overlapping of computation and communication, particularly for tiled solutions. It relieves the user of complexity in dealing with thread affinitization, offloading, memory types, and memory affinitization.

Morton Ordering on the Intel Xeon Phi

The Morton order is a mapping of multidimensional data to one dimension that preserves locality of the data. This is also known as Z-order. “By using Morton ordering as an alternative to row-major or column-major data storage, significant speedups can be achieved on the Intel Xeon Phi coprocessor or Intel Xeon CPU when performing matrix multiplies or matrix transposes.”

Sparse Matrix Multiplication

“A parallel implementation of SpMV can be implemented, using OpenMP directives. However, by allocating memory for each core, data races can be eliminated and data locality can be exploited, leading to higher performance. Besides running on the main CPU, vectorization can be implemented on the Intel Xeon Phi coprocessor. By blocking the data in various chunks, various implementations on the Intel Xeon Phi coprocessor can be run and evaluated.”

Fast Matrix Multiply with OpenMP

Solving many scientific and technical applications entails the use of matrix multiplies somewhere in the algorithm and thus the computer code. With today’s multicore CPUs, proper use of complier directives can speed up matrix multiplies significantly.