Sign up for our newsletter and get the latest HPC news and analysis.

NCAR plus CUDA equal faster weather simulation, community benefits [UPDATED]

NCAR and NVIDIA have started sharing some details about the progress NCAR has made on their modeling workflow using GPU acceleration.

The Weather Research & Forecasting Model (WRF) is the most widely used model in the world, with users including the National Weather Service, the Air Force Weather Agency, foreign weather services and commercial weather forecasting companies. …

nVidia logoIn examining ways to improve overall forecasting speed and accuracy, NCAR technical staff in collaboration with the researchers at the University of Colorado at Boulder turned to NVIDIA GPU Computing solutions. After porting to NVIDIA CUDA, there was a 10X improvement in speed for Microphysics, a crucial but computationally intensive component of WRF. Although Microphysics made up less than 1 percent of the model source code, converting it to CUDA resulted in a 20 percent speed improvement for the overall model.

Good stuff, but the release (which hasn’t been released yet but which you can read in this blog post at InfoWorld) left me wanting a few details, so NVIDIA’s awesome press person got me in touch with NCAR’s John Michalakes for some more detail. Here is the Q&A with John.

How big is the job (processors, memory, simulation scale)?

Ran on up to 48 MPI tasks of 16 node NCSA Opteron/GTX 5600 cluster ( Scale is 12km resolution run over the continental U.S., using 300×425 by 35 vertical layers. Total number of floating point operations required for the 3 hour compute-only (no I/O) benchmark is 3.3E12 floating point operations. This is considered a moderate to small sized problem and is one of the standard WRF (the Weather Research & Forecasting model) benchmark cases ( ). Not sure, off hand, of aggregate memory requirements but we use fewer than 512MB per task.

Are benefits similar for other size jobs where these dimensions change? (I’m trying to get a sense for the general applicability of the result.)

We expect so, yes, though we’re still working on a more general scaling and performance model.

How much effort (person-weeks, whatever) was required for the port?

One semester (4 months) working part-time. Most of the effort involved the conversion from Fortran to C and then debugging and validation with respect to the original kernel.

Were the changes contributed back to the primary code tree for WRF, or is this a separate/proprietary fork?

WRFV3.0, released in May 2008, was distributed to the community with the CUDA-ized microphysics module, making WRF the first weather model released with an option for GPU.

First off, thanks John (and Kerry) for getting back to me with answers. And second, kudos on contributing the work back to the community fork!

The answer to my level of effort question exposes an interesting wrinkle I hadn’t thought of before: the need to convert the accelerated code segments to C before CUDA-izing. That can be a nontrivial problem, especially for big codes. I do note from NVIDIA’s CUDA docs that a Fortran interface to at least part of the CUDA kernel is available, and I’ve sent in a followup question on why that didn’t work for NCAR and what this might mean for you. Stay tuned.

[UPDATE] And here is John’s answer on the Fortran angle. Essentially the “FORTRAN interface to CUDA kernel” doc I mention above is an example of how to call a CUDA kernel from within a Fortran program, but is not a Fortran solution for writing CUDA kernels.

 This file contains an example of how to call a CUDA kernel from within a Fortran program, but is not a Fortran solution for writing CUDA kernels themselves.  Note that the kernel code provided with the example, in the file, is C with CUDA extensions.  In producing the CUDA kernel of the weather model microphysics, we were developing the kernel itself from the about 1500 lines of the native Fortran, not inserting calls to an already developed CUDA kernel. Hence the need to convert Fortran to C first. According to NVIDIA, a Fortran compiler for CUDA is due later this year and we expect this will help by automating that Fortran-to-C step in the process of migrating the weather code to the GPU.


  1. Calling CUDA from Fortran is the same as C. The Fortran interface allows a programming to setup the CUDA data structures in Fortran (by using cudaMalloc), then pass them to CUDA for execution.

    The CUDA compiler looks more like a preprocessor than a compiler. It passes modified code to the underlying C compiler (gcc by default) for compilation. What would be nice is a FUDA compiler, where you can write kernel (the code that executes on a GPU) so you wouldn’t have to translate your code to C first, then to CUDA.


  1. [...] Full Story More details of WRF application on insideHPC [...]

Resource Links: