Applications that use 3D Finite Difference (3DFD) calculations are numerically intensive and can be optimized quite heavily to take advantage of accelerators that are available in today’s systems. The performance of an implementation can and should be optimized using numerical stencils. Choices made when designing and implementing algorithms can affect the Arithmetic Intensity (AI), which is a measure of how efficient an implementation, by comparing the flops and memory access.
For a given computer system, the peak performance in terms of flops can be calculated as a function of clock speed, number of cores and sockets and processing power per cpu cycle. The achievable rate can be determined by looking at the LINPACK numbers, which is a highly optimized equation solver and the Stream Triad benchmark which measures memory bandwidth. The AI of a system can be calculated in two ways.
- Divide the theoretical performance by the theoretical memory bandwidth. AI (theoretical) = CPU theoretical/Memory Bandwidth (theoretical).
- Divide the LINPACK value by the Stream Triad dividing the LINPACK value in terms of flops by the computed value of maximum memory access. AI (achieved) = Linpack/Stream.
These values can be computed for either the main Intel Xeon CPU system or the Intel Xeon Phi coprocessor system.
An additional measurement of performance can be calculated when the exact instruction set and memory access is understood for the application, or more specifically for inner kernels in the code. The AI can be calculated by adding the number of ADDs and MUL and dividing by the number of LOADs and STOREs (multiplied by the word size). With some simple modifications and plotting of these values of the AI (flops/byte), rooflines can be shown where the maximum achievable performance can be calculated.
Once this is understood, optimizations on the code can be performed. From removing branches inside of loops, to reducing the number of cache misses and vectorising some of the loops, performance increases can be obtained. Additional steps might include the unrolling of loops, the factorization of coefficients, memory alignment and allocation and better use of registers. Detailed knowledge of the application and mathematical algorithms will be needed to make the modifications required.
A three step methodology can be implemented for a variety of applications when performance is critical.
- Estimate the best achievable performance through understanding of the underlying system architecture.
- Tune the application for parallelism, data locality and vectorization.
- Use auto-tuning for the best set of parameters.
The process to gain maximum performance of an application should be understood. A combination of hand tuning and automatic optimization will yield the best performance, all while keeping in mind that the maximum performance, as measured by different AI’s should be the goal.
Source: Intel, France and Intel, USA
Transform data into opportunity. Speed data analysis in your applications. Intel® Parallel Studio XE
One important optimization worth mentioning is cache blocking, a good paper would be “3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs ” [1]. Luckily Stencil Codes are a well studied subject, so library solutions such as Physis[2] and LibGeoDecomp[3] are readily available.
[1] http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5645463&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5645463
[2] https://github.com/naoyam/physis
[3] https://libgeodecomp.org/