PreFetch for Intel Xeon Phi – Part 2

Print Friendly, PDF & Email

phi-compressorPrefetching of data can greatly enhance the performance of an application, whether running on a host Intel Xeon CPU or the Intel Xeon Phi coprocessor. Techniques to prefetch data can be categorized as either software or hardware methods.

An interesting aspect to prefetching is the distance ahead of the data that is being used to prefetch more data. This is a critical parameter for success and can be defined as how many iterations ahead to issue a prefetch instruction, and can be referred to as the distance. A compiler will automatically determine the distance to prefetch, and can be determined by looking at the compiler optimization reports.  The option, -qopt-report=5 will give the highest level of detail. When compiling an application, a developer can override the defaults by using the –qopt-prefetch-distance=n1,n2 compiler option.  N1 represents the distance for prefetching from the L2 memory, while n2 represents the distance for prefetching from L1 to L2.  L1 and L2 are the different caches.

As an example, prefetch tuning can be applied on a coprocessor.  Looking at the hardware performance counters, it can be determined what is the coverage and what the efficiency is for a give application.  Coverage is a measure of how well prefetches replace demand misses during the execution of the code, which would lead to higher performance. Efficiency measures how expensive prefetching was. If a lot of prefetches are issued but are useless, then the cost will outweigh the benefits.  If the efficiency is low, then more work has to be done to get the data where it needs to be.

In looking at the STREAM TRIAD, a simple comparison can be made that looks at:

1 – no prefetching

2 – hardware only prefetching

3 – hardware and software prefetching

As expected in this type of benchmark, the more cache misses, the lower the performance. The best result was from a combination of HW and SW prefetching.

In summary, prefetching can affect the performance of most applications, if the developer understand the memory access patterns better than the compiler.   It is important to test out various prefetching compiler options as part of any development activity.

Source: Penn State University, USA; Intel, USA; Intel, Romania

Transform data into opportunity. Speed data analysis in your applications.

Intel® Parallel Studio XE