insideHPC Special Report Optimize Your WRF Applications – Part 2

Print Friendly, PDF & Email

This special report sponsored by QCT discusses how the company can work with leading research and commercial organizations to lower the Total Cost of Ownership by supplying highly tuned applications that are optimized to work on leading-edge infrastructure. By reducing the time to get to a solution, more applications can be executed, or higher resolutions can be used on the same hardware. QCT also has experts that understand in detail various HPC workloads and can deliver turnkey systems that are ready to use. For customers that wish to modify  source code or that develop their own applications, QCT supplied highly tuned libraries and extensive  guidance on how to get the most out of your infrastructure, that not only includes servers, but networking and storage as well.

This technology guide, insideHPC Special Report Optimize Your WRF Applications, shows how to get your results faster by partnering with QCT.

Introduction to WRF

WRF is a regional weather model with users ranging from researchers to forecasters all over the globe. Noted  for being a mature and sophisticated model for weather research, WRF produces initial weather  conditions for environmental models, such as air quality models, small-scale Large Eddy Simulation (LES)  models, and disaster assessment models. WRF is among one of the significant workloads in major High- Performance Computing (HPC) systems, thus understanding how WRF performs and behaves under different optimizations could increase the HPC efficiency and thus reduce operating costs.

Similar to other weather and climate models, WRF discretizes the target simulated area into three- dimensional grids. The physics properties of each grid are then dispatched to computational threads to  calculate their tendencies (the rate of change of the physical properties in the time step). After each time step  is finished, the computed results will propagate to the corresponding grids both horizontally and  vertically, depending on the calculated direction of the wind. WRF is highly parallelized and takes advantage  of the distributed-memory method using MPICH, the shared-memory method using OPENMP, or the  combination of both techniques, a hybrid approach.

Characteristics of the workload

Because of the grid approach and the parallelization of WRF, there is a large amount of data that is  transferred between grids after each time step is completed. Thus, the overall performance is dependent on  the high memory bandwidth and low latency of the interconnecting network. The output, which is a massive list of variables from all the grids, requires high efficiency storage bandwidth. QCT investigated the WRF  performance impact from the latencies and bandwidth from both inside the processors and the chosen interconnect.

Benchmark settings

QCT ran the WRF benchmarks on a total of three QuantaPlex T42D-2U servers. Each T42D-2U server consists  of four dual-socket computing nodes in a 2U form factor.

In total, twelve nodes were used to evaluate the scalability of WRF performance. Each node consists of two  Second-Generation Intel® Xeon® 8280 Scalable Processors (28 cores at 2.7Ghz base frequency) and 384GB  DDR-4 2933 memory on each node, which results in a total of 56 cores and 384GB of memory in one node,  or 224 cores and 1296GB of memory in each T42D-2U system. Each node connects with other computing  nodes and storage nodes with 10 Gbits/s Ethernet and Infiniband HDR-100 100 Gbit/s networks. The BeeGFS  parallel file system is used as the underlying file system to maximize storage throughput. The hardware  specification is listed in the inset image.

WRF settings

QCT used WRF V4.1.5 for the benchmark investigation. QCT followed Kyle (2018)’s work and created a new  CONUS 2.5km domain for version 4 of WRF. The Conus 2.5km domain as shown in Figure 1 below, consists of  1901 x 1301 grid points and 40 vertical layers. The results were measured by the averaged WRF-output  computation time of each time step. Also, the output benchmark was measured by the averaged WRF-output  computation time of each output time step.

Compiler Options

QCT used three different compilers with the latest version available to compile WRF and its dependent libraries (OpenMP/Mvapich, NetCDF, HDF5). The three compilers tested were GNU compiler v 9.2.0, AOCC compiler version 2.1.0 by AMD, and Intel® FORTRAN compiler (part of Intel® composer XE version 2020). The compiler flags other than the default WRF settings are listed below:

  • GNU compiler version 9.2.0 (gcc): -O3 (the default is -O2).
  • AOCC compiler version 2.1.0 (aocc): -O3. Adapted from the WRF default GNU compiler setting to CLANG/FLANG settings, and change -O2 to -O3. -Mbytwswapio. Ensure the endianness of WRF input/output
  • Intel® compiler version 19.1.1 (v2020) (ifort): -xCORE-AVX512 (or -Xhost AVX512).
    Optimized for Cascade Lake Xeon® 8280, utilizing the full 512-bit SIMD instruction set.

Over the next few weeks we will explore these topics surrounding WRF applications and how you can get your results faster by partnering with QCT:

Download the complete insideHPC Special Report Optimize Your WRF Applications, courtesy of QCT.