The Intel Xeon Phi coprocessor is an example of a many core system that can greatly increase the performance of an application when used correctly. Simply taking a serial application and expecting tremendous performance gains will not happen. Rewriting parts of the application will be necessary to take advantage of the architecture of the Intel Xeon Phi coprocessor.
Peak performance is a number, that is the maximum possible for a given piece of computing hardware. Users will never obtain the peak performance, as that number measures all out performance, if all instructions are lined up, and memory access is not an issue. The peak performance is measured with the following equation, modified for the Intel Xeon Phi coprocessor:
Peak Performance = Clock Frequency X number of cores X 16 lanes X 2 flops/cycle.
For double precision performance the 16 lanes are reduced to 8 lanes. Currently, this will give the peak performance of 2129.6 Gigaflops for single precision and 1064.8 Gigaflops for double precision. Both of these numbers are over 1 Teraflop in peak performance.
The memory subsystem bandwidth is also critical when design applications to take advantage of this type of hardware. GDDR5 memory, which the Intel Xeon Phi uses has a peak bandwidth of 352 Gigabytes per second. Achievable memory bandwidth will be in the 50 % to 60 % range of this value. It is important to understand these values when architecting an HPC application.
Since the Intel Xeon Phi coprocessor is actually running a copy of Linux, users will have to log into the node, or have the coprocessor start automatically. To start with, an application is compiled with certain flags that will generate Many Integrated Core (MIC) instructions that will be run on the coprocessor. Simple programs can be designed and run on the Intel Xeon Phi coprocessor, to measure peak performance and memory bandwidth.
Further investigation of an application can lead to areas that need to be optimized, and run with more than one thread. Just using a single thread on a coprocessor is a waste of resources. Applications need to take advantage of the number of cores available. Understanding the algorithms used in an application is critical to gaining better performance and towards obtaining peak performance.
Source: Intel, USA