Sign up for our newsletter and get the latest HPC news and analysis.
Send me information from insideHPC:


Prefetching Data for Intel Xeon Phi

xeon_e5_hugePrefetching of data is known to improve performance for many applications. Through a combination of hardware and software, most modern systems include some level of prefetching of the data. By issuing data access before the data is actually required leads to reduced latency of memory access.

Sophisticated prefetchers can detect memory access patterns and can use this knowledge to reduce cache misses. While many prefetching techniques are handled by the compiler or by the hardware, knowledge of the application algorithms will benefit the application performance greatly, with respect to getting the data to the processor with reduced latency.

Prefetching on a coprocessor such as the Intel Xeon Phi coprocessor can be more important than on a main CPU such as the Intel Xeon CPUs. Since the cores on the Intel Xeon Phi coprocessor are in-order, they cannot hide memory latency as compared to an out-of-order CPU. In addition, since a coprocessor does not have an L3 cache, L2 misses must then access the slower memory subsystem.

There are two types of prefetch instructions on the Intel Xeon Phi coprocessor:

  • VPREFETCH1 – prefetches data from memory to L2
  • VPREFETCH0 – prefetches data from L2 to L1

Ideally, data should be prefetched from memory to L2 and from L2 to L1 before being used by the core. Thus, when the core needs the data, it is in the closest cache to the processor.

An example of this use is STREAM, a synthetic benchmark that measure sustainable memory bandwidth.  Four vector kernels are measured: Copy, Scale, Add, and Triad.  By first analyzing the application with the Intel VTune Amplifier, the number of cache misses can be determined, as well as other performance data. An example is a coverage metric, which is defined as how well the prefetches replace the demand misses at execution.  Perfect coverage would mean that all the data required would be in the cache, and that there would be no cache misses (except for prefetches).

There’s more to come about prefetching in a future article, so stay tuned.

Source:  Penn State University, USA; Intel, USA; Intel, Romania

New Tools, New Rules Create faster code—faster—with the new Intel® Parallel Studio XE. Try it today.

Resource Links: