Exascale: SUPERLU/STRUMPACK Solvers Get Frontier Upgrade

SuperLU and STRUMPACK have been used by simulation codes for projects such as the ITER tokamak, be the largest fusion device of its kind when built. Image: US ITER

By Coury Turczyn, Oak Ridge National Laboratory

In 2016, the Department of Energy’s Exascale Computing Project (ECP) set out to develop advanced software for exascale-class HPC systems capable of a quintillion (1018) or more calculations per second. That meant rethinking and optimizing scientific applications and software tools to leverage a thousandfold increase in computing power. That time has arrived as the first DOE exascale computer — the Oak Ridge Leadership Computing Facility’s Frontier supercomputer — opened to users around the world. This article is part of “Exascale’s New Frontier” exploring applications and software driving scientific discoveries in the exascale era.

Why Exascale Needs SuperLU and STRUMPACK

Scientists use supercomputers to run simulations that model different natural phenomena — from core-collapse supernovas to the molecular dynamics occurring within a drop of water. Many of these codes must solve sparse linear systems of equations, or ones that lead to matrices with mostly zero-value elements, to produce their simulations. One of the most efficient approaches to solving such calculations in large-scale multiphysics and multiscale modeling codes is the use of factorization-based algorithms.

SuperLU (named after lower-upper factorization) and STRUMPACK (STRUctured Matrix PACKage) are two such open-source, factorization-based sparse solvers that have been widely used for simulation in both industry and academia — SuperLU since 1999 and STRUMPACK since 2015. However, with the advent of exascale-class supercomputers that enable much larger and higher-resolution simulations, these two CPU-centric packages required major updates to run well on the new GPU-accelerated architectures.

“Before ECP, both packages had very little support for GPUs. We could get some benefit from running on a single GPU, but without updating the code, even 10 GPUs wouldn’t make it run much faster. We had to redesign a lot of algorithms internally to be able to use GPUs effectively,” said SuperLU/STRUMPACK project leader Sherry Li, a senior scientist and group lead at Lawrence Berkeley National Laboratory.

“Also, from the application side, most of the exascale applications must use GPUs to achieve their speedups. So, if we didn’t optimize these solvers for GPUs, the whole software chain would have become a bottleneck,” Li said.

Technical Challenges

Exascale supercomputers such as Frontier contain thousands of compute nodes, which hold the system’s CPUs and GPUs. Matrices in many simulation codes must be distributed across all these nodes to fully leverage the machine’s massive parallelism. This distributed computing requires the solver’s algorithms and their work to be communicated and coordinated among these nodes and their GPUs and CPUs. Although integral to distributed parallel computing, this communication becomes an encumbrance at extreme scales.

“It’s inevitable to do communication during the algorithm’s execution to transfer the pieces of data among different nodes. And that turns out to be very costly, especially for a sparse matrix compared to a dense matrix. In the sparse case, you have relatively less computational work, but more interdependency and communication,” Li said.

Scaling up the SuperLU and STRUMPACK algorithms to work effectively across thousands of nodes required the team to develop new factorization methods that required much less memory and data movement.

“That was one of the big challenges for this project — we had to invent some of the algorithms to avoid communication or to reduce communication. This is one of the major algorithm advances we contributed, and then we implemented that in the software so people can use it,” Li said.

ECP and Frontier Successes

The updated SuperLU and STRUMPACK solvers have been successfully implemented in several important simulation codes:

  • M3D-C1 is used for calculating the equilibrium, stability and dynamics of fusion plasmas.
  • Omega3P is used for cavity design optimization for projects such as linear particle accelerators.
  • MFEM is used for scalable finite element discretization research and application development.

These libraries have also been used in higher-level mathematical frameworks, which build on lower-level linear algebra kernels.

What’s Next?

The SuperLU/STRUMPACK project will continue developing its codes as newer computer architectures arise. Li also plans to optimize the packages for applications in artificial intelligence and machine learning, which can work well using lower-precision arithmetic (i.e., half-precision, 16-bit arithmetic instead of single-precision, 32-bit arithmetic) or a combination of higher and lower precisions (i.e., mixed-precision arithmetic).

“Nowadays, a lot of the hardware equipped with GPUs provides precision lower than single precision. And some of our algorithms may not work very well for lower-precision arithmetic,” Li said. “There’s still a lot of work that must be done to implement those. But it’s going to be very fruitful because we imagine that leveraging this lower-precision hardware can have a significant speedup for the training of AI/ML models, which is usually the bottleneck right now.”

Support for this research came from the ECP, a collaborative effort of the DOE Office of Science and the National Nuclear Security Administration, and from the DOE Office of Science’s Advanced Scientific Computing Research program. The OLCF is a DOE Office of Science user facility.