insideHPC Special Report Optimize Your WRF Applications – Part 3

Print Friendly, PDF & Email

This special report sponsored by QCT discusses how the company can work with leading research and commercial organizations to lower the Total Cost of Ownership by supplying highly tuned applications that are optimized to work on leading-edge infrastructure. By reducing the time to get to a solution, more applications can be executed, or higher resolutions can be used on the same hardware. QCT also has experts that understand in detail various HPC workloads and can deliver turnkey systems that are ready to use. For customers that wish to modify  source code or that develop their own applications, QCT supplied highly tuned libraries and extensive  guidance on how to get the most out of your infrastructure, that not only includes servers, but networking and storage as well.

This technology guide, insideHPC Special Report Optimize Your WRF Applications, shows how to get your results faster by partnering with QCT.

Benchmark Results

The first to be measured was WRF performance across popular compilers. Among the three compilers that  were used to compile WRF and the corresponding libraries with, the Intel® compiler performs best, and leads  other counterparts by more than 25%. Figure 2 shows the average execution time of each computation  timesteps of WRF. Intel®-compiled WRF has ~ 25% less execution time compared to the other two. Figure 2 shows these results.

Next to be investigated are the communication libraries. Figure 3 shows that with the integration of  Infiniband Mvapich2 (v2.3.4) libraries decrease WRF execution time by ~ 5% as compared to the Intel® MPI (v2020 update 1) and the OpenMPI (v4.0.3) libraries.

Impact of latency

Next to be investigated was the comparison of Infiniband and Ethernet. WRF was executed on 1, 4, 8, and 12  nodes over Infiniband HDR and 10G Ethernet to examine the impact of interconnect latency on its  performance. The node-to-node latency of Infiniband HDR starts from 1.01 microseconds of 1-Byte packet size, and 10G Ethernet starts from 8.7 microseconds. WRF performs three times better over Infiniband on four  nodes than over Ethernet, and approximately six times better on 12 nodes. Figure 4 shows these results.

OpenMP allows different cores to share the same segment of memory. The performance of WRF is best with OMP_NUM_THREADS=4 and decreases more than 10 percent when OMP threads exceed four. The trend of  increasing WRF performance is attributable to the four dips of latencies within the sockets (28 cores) as  shown in Figure 5 shows a particular group of low-latency cores could improve the WRF performance. Also,  WRF divides the sub-domains by the OMP Thread number. An OMP that cannot be wholly divided by 28  would result in a subdomain that needs to use cores on both sockets, which increases to core-to-core latency  drastically. The decrease in performance when OMP_NUM_THREADS exceeds four shows the impact  of the latency increase by crossing CPU sockets on WRF. One should take a careful arrangement of process  affinity to CPU cores to avoid performance drop. Figure 6 below shows the performance as a function of the  number of OMP threads.

Next, QCT further investigated the WRF performance when sub-Numa clustering (SNC) was turned on. SNC  allows a single Xeon® 8280 CPU to split into two groups of cores and thus decrease to core-to-core latencies  within the subnuma domain, as shown in Figure 7. QCT found turning on the SNC increases the WRF performance by 1-2 percent when OMP threads are less than 2. But the performance deteriorates  drastically on threads number four because four cannot be divided evenly into 14 cores and has to run across  two sub-Numa domains. The experiments show the importance of grouping the low-latency cores  and avoiding the imbalanced OpenMP partitioning of WRF subdomains.

Summary of WRF Benchmarks

The performance of WRF V4.1.5 highly relies on the compiler and the latencies between processors and  interconnect. The Intel® compiler shows excellent execution performance for Fortran codes. The test on  interconnect fabric and protocol, as well as the communication between CPU cores, shows the impact of increased latencies on WRF execution time. QCT highly recommends using Infiniband and group the adjacent OMP threads in low-latency memory (such as cache on each CPU) to reduce the impact on intercommunication.

QCT Expertise

QCT can work with leading research and commercial organizations to lower the Total Cost of Ownership by supplying highly tuned applications that are optimized to work on leading-edge infrastructure. By reducing  the time to get to a solution, more applications can be executed, or higher resolutions can be used on the  same hardware. QCT also has experts that understand in detail various HPC workloads and can deliver  turnkey systems that are ready to use. For customers that wish to modify source code or that develop their  own applications, QCT supplied highly tuned libraries and extensive guidance on how to get the most out of  your infrastructure, that not only includes servers, but networking and storage as well.

For more information on how QCT can help you to maximize your HPC environments, please visit:
https://go.qct.io/solutions/data-analytic-platform/qxsmart-hpc-dl-solution/

Over the last several weeks we’ve explored these topics surrounding WRF applications and how you can get your results faster by partnering with QCT:

Download the complete insideHPC Special Report Optimize Your WRF Applications, courtesy of QCT.