Sign up for our newsletter and get the latest HPC news and analysis.
Send me information from insideHPC:


N-Body Methods Optimization

phiN-Body problems compare the interaction of N-bodies against N-bodies, which results in calculations of the order of N2. As this can be computationally very expensive, but a well understood process, techniques and optimizations can be performed on application code using compiler directives and easy to understand techniques. N-Body methods are popular in physics calculations as well as problems in structural mechanics, fluid mechanics and acoustics, although there may be other better suited methods in many cases.

For example, a simple N-Body kernel will calculate the forces over a double loop. The outer loop may calculate over the target bodies, while the inner loop may loop over the source bodies. In order to gain performance, the loops can be rewritten to use the SIMD intrinsics, which map directly to assembly instructions, which gives more control over the actual code and does not have to rely on what the compiler decides to do.  These intrinsics tell the compiler to directly perform load, store, fmadd and rsqrt operations.  The body of the loops do not have to change, but the intrinsic instructions of _mm512 and _m512 registers are used. The outer loop can now be written in strides of 16 as well.

The performance results of using these straightforward techniques can result in a tremendous performance boost. Using a setup which consisted of an Intel Xeon E5 v2 processor and an Intel Xeon Phi coprocessor, and using the Intel compilers a test was run using 65,536 bodies. The maximum performance attained was in the 1.5 Teraflop range, using the compiler options of “icc –mmic –openmp –fimf –domain-exclusion=15”, as well as adding the #pragma simd” to the original C code.  Using the _mm512 instrinsics, performance was about 1.4 Tflops, for even smaller problem sizes.

In summary, vectorizing the inner loop gave better performance for smaller problem sizes, while with about N > 16,384 vectorizing the outer loop showed better performance.  The vectorization of the outer loop was made possible by the use of the #pragma simd directive.

Transform Your Code

Deliver top application performance and reliability with Intel Parallel Studio XE: A C++ and Fortran tool suite that simplifies the development, debug, and tuning of code. Compatible with leading compilers.

Source: King Abdullah University of Science and Technology, Saudi Arabia

 

Resource Links: