Simulation of physical processes such as the waves in an ocean or the wake behind a boat, although similar in a number of ways, require different approaches. With current systems designed with many parallel computational units, it is important to take advantage of the range of architectural features. Using the gas dynamics code, HYDRO2D, the performance of the code can be examined and improved by taking advantage of a range of system features.
Scientists will always like to add more realism to simulation or add more cells or atoms depending on their application and algorithms. Faster performance is always a goal, and tuning algorithms must constantly be worked on to take advantage of new features in the hardware. The compiler is an important part of the computing environment, and must be aware and take advantage of the hardware (i.e. micro-architecture), runtime, and other development tools. The scientific code may be written by the domain expert, but then refined and tuned by an expert programmer. The scientist will know about advancement in their field, but may not be such an expert in computing environments, especially given that code implementation and maintenance may span decades.
Hydro2D is a medium sized code that has been developed by CEA, a French government agency. At about 5000 lines of C/C++, Hydro2D solves shock hydrodynamics problems in two dimensions. Hydro2D is a shock-capturing code and uses Gudonov’s method to compute solutions to initial boundary-value problems posed by the user. (1)
A project was undertaken to improve the performance of Hydro2D. A number of factors were considered, including the memory subsystem, thread-level parallelism, data-level parallelism and instruction-level parallelism. The test configuration consisted of a 2 socket Intel® Xeon® Processor E5-2680 and the Intel® Xeon Phi™ Coprocessor (Model 7120P). In order to investigate performance and optimize the code, an initial set of benchmarks was run on the un-optimized code. Initially, the performance of the Intel Xeon Phi was less than half of the Intel Xeon performance and parallelization was quite low.
A number of optimizations were then performed. Memory usage was looked at, including reorganizing parts of the code based on the algorithm. Communication between threads was also reduced. The first set up optimizations increased the performance on the Intel Xeon CPUs by 2.0 to 3.3 X and from 2.3 to 3.6 X on the Intel Xeon Phi.
Data level parallelism was then looked at. Originally the code relied on compiler directives to achieve vectorization, but it was determined that the programmer knew better. By looking more closely at the code, peeling of loops and write masking for the Intel Xeon Phi. Loop peeling refers to moving some of the code outside of the loop to take care of first or last iterations as these can present special or not optimal cases.
In summary, various techniques can be used to improve performance of a computationally intensive code. By understanding which parts of an application should run on the co-processor, in this case the Intel Xeon Phi, tremendous performance improvements can be achieved. By studying the code and understanding both the algorithm and the computer architecture, a reversal from the initial benchmarks showed that the co-processor now outperforms the main processor by 1.3 to 1.5 X. Overall, the performance increased by up to 12 X for the co-processor and 5 X for the processor. The following techniques were used:
- Consider how different mathematically equivalent formulations perform on a given architecture.
- Evaluate hot loops and how vector hardware can be used.
- Look at the work sent to the different processor cores.
- Consider memory hierarchy and how to optimize data movement.
(1) Godunov, S.K., 1959. A difference method for numerical calculation of discontinuous solutions of the equations of hydrodynamics. Metematicheskii Sbornik 89 (3) 271-306.
Source: Intel, USA and CEA, France