Sponsored Post
Few problems are more computationally intense than magnetohydrodynamics (MHD) simulations for astrophysics. Even with the best algorithms and hardware, some calculations can take weeks to complete.
Simulations – mathematical modeling – is used to discover the evolutionary processes that created and continue to shape the universe. Clearly, performing experiments in the laboratory here on Earth are just not possible. But simulating these complex cosmic processes at high resolution is possible and requires the most powerful supercomputers.
At Novosibirsk State University (NSU), a major research and education center in Siberia, astrophysicists needed to optimize performance of the AstroPhi project codes they were developing for Intel® Xeon PhiTM processor-based hardware. This valuable project helps students learn to create numerical simulation codes for massively parallel supercomputers.
A key aspect of the AstroPhi project was optimizing the code for maximum performance on the Intel Xeon Phi processors. Before optimization, the team had difficulty identifying vector dependencies and choosing the best vector sizes. The goals for optimizing the code were to remove vector dependencies that inhibited optimization and to optimize memory load operations by efficiently adapting vector and array sizes for the Intel Xeon Phi architecture. To help achieve these goals, the team turned to Intel Advisor and Intel Trace Analyzer and Collector, tools that are part of Intel Parallel Studio XE.
The NSU team co-designed a new solver for massively parallel architectures based on Intel Xeon Phi processors. They based the solver on Intel Advanced Vector Extensions 512 (Intel AVX-512) instructions. These instructions deliver 512-bit SIMD support and enable programs to pack eight double-precision or 16 single-precision floating-point numbers, or eight 64-bit integers, or 16 32-bit integers within the 512-bit vectors. This enables processing twice the number of data elements that AVX/AVX2 can process with a single instruction, and 4X that of SSE.
On today’s processors, it is crucial to both vectorize (using AVX* or SIMD* instructions) and parallelize software to realize the full performance potential of the processor. Using Intel Advisor, part of Intel Parallel Studio XE, the team was able to perform a roofline analysis to highlight poor-performing loops and show performance headroom for each loop, identifying which can be improved and which are worth improving.
The team reported that Intel Advisor made it easier to identify bottlenecks and determine the best optimization strategies by forecasting performance gains in various scenarios, greatly eliminating wasted implementation time. Intel Advisor provided the project team tips for effective vectorization along with key data like trip counts, data dependencies, and memory access patterns, to make vectorization safe and efficient.[clickToTweet tweet=”Optimizing with Intel Parallel Studio XE, the NSU team cut the time for one problem from one week to just two days. ” quote=”Optimizing with tools from Intel Parallel Studio XE, the NSU team cut the time for calculating one problem from one week to just two days. “]
Also, using the graphical Intel Trace Analyzer and Collector increased the team’s understanding of the application’s MPI communication behavior across nodes. Here too they were quickly able to find bottlenecks, improve correctness, and maximize the application’s performance on Intel architecture. MPI communications profiling and analysis features helped to improve application scaling.
By optimizing their applications with tools from Intel Parallel Studio XE, and running on the latest Intel hardware, the NSU team achieved a performance speed-up of 3X, cutting the standard time for calculating one problem from one week to just two days.
Intel Parallel Studio XE is a comprehensive software development suite of compilers and tools that gives developers the ability to maximize application performance on today’s and future processors by taking advantage of the ever-increasing processor core count and vector register width.
Download your free 30-day trial of Intel® Parallel Studio XE