In some domains, an N-Body simulation is key to solving for the movement and forces of a dynamic system of particles. The tools are most widely associated with astrophysics. At each time step, the force that one body exacts on each other, and then the velocity can be computed. The simulation can continue up to a desired number of time steps.
The velocity and the positon of each particle or body is of O(n2) complexity. While this is computationally expensive, it is still used when simulating particles in the same domain. The known algorithm has been ported to the Intel Xeon Phi coprocessor. Additional optimizations can be used to speed up the processing.
- Convert the code from an array of structures to a structure of arrays. This allows for a better implementation on the Intel Xeon Phi coprocessor as the elements will now be consecutive in memory.
- Instruct the compiler to generate code without IEEE precision.
An initial test with various number of particles showed that a single precision implementation peaked at between 10,000 and 60,000 particles and that the double precision version peaked at about 30,000 particles. The DP version was 1/3 of the speed of the SP implementation, using the Intel Xeon Phi coprocessor 7120P.
Another optimization step included aligning the data using the OpenMP directive “vector aligned”. In addition, moving the OpenMP “parallel” construct outside of an inner loop showed a performance gain. By implementing these two additional optimizations the performance increased the performance significantly.
Further improvements are focused on optimizing the memory hierarchy in the system and looking at alternatives for the right tiling factor and how it relates to the L2 cache. By utilizing these techniques, the performance improvement improved by almost 50 % for the SP version and 66 % for the DP version. Also, when utilizing these optimizations, the performance reached close to 90 % of the theoretical performance of the Intel Xeon Phil coprocessor.
Source: Intel, Spain and Intel, USA.