Pushing Past the Petaflop
Bykov and Barca started working together several years ago as part of the Exascale Computing Project, DOE’s research, development and deployment effort focused on delivering capable exascale ecosystem. The duo’s collaboration within the ECP was focused on optimizing decades-old codes to run on next generation supercomputers built with completely new hardware and architectures. Their goal with EXESS was not only to write a new molecular dynamics code for exascale machines but also to create a simulation that would put Frontier to the test.
The HPE-Cray EX Frontier supercomputer, located at the Oak Ridge Leadership Computing Facility, or OLCF, is currently ranked No. 1 on the TOP500 list of the world’s fastest supercomputers after achieving a maximum performance of 1.2 exaflops. Frontier has 9,408 nodes with more than 8 million processing cores from a combination of AMD 3rd Gen EPYC CPUs and AMD Instinct MI250X GPUs.
The team’s efforts on Frontier were a huge success. They ran a series of simulations that utilized 9,400 Frontier computing nodes to calculate the electronic structure of different proteins and organic molecules containing hundreds of thousands of atoms.
Frontier’s massive computing power allowed the research team to shatter the ceiling of previous molecular dynamics simulations at quantum-mechanical accuracy. It was the first time a quantum chemistry simulation of more than 2 million electrons had exceeded an exaflop when using double-precision arithmetic.
This isn’t the first time the team has raised the bar for these kinds of simulations. Prior to their work with Frontier, they had similar success on the 200 petaflop Summit supercomputer, Frontier’s predecessor, also located at the OLCF. In addition to being 1,000 times larger and faster, the exascale simulations also predict how chemical reactions happen over time, something they lacked the computing power to do previously.
The average run times of the simulations ranged from minutes to several hours. The new algorithm enabled the team to simulate atomic interactions in time steps — essentially snapshots of the system — with significantly improved latency compared to previous methods. For example, time steps for protein systems with thousands of electrons can now be completed in as little as 1 to 5 seconds.
Time steps are crucial for understanding how certain processes naturally evolve over time. This resolution will help researchers better understand how drug molecules can bind to disease-causing proteins, how catalytic reactions can be used to recycle plastics, how to better produce biofuels and how to design biomedical materials.
“I cannot describe how difficult it was to achieve this scale both from a molecular and a computational perspective,” Barca said. “But it would have been meaningless to do these calculations using anything less than double precision. So it was either going to be all or nothing.”
“Two of the biggest challenges in this achievement were designing an algorithm that could push Frontier to its limits and ensuring the algorithm would run on a system that has more than 37,000 GPUs,” Bykov added. “The solution meant using more computing components, and any time you add more, it also means there’s a greater chance that one of those parts is going to break at some point. The fact that we used the entire system is incredible, and it was remarkably efficient.”
On a personal note, Barca added, after he and his team had worked around the clock for weeks in preparation, the calculation that broke the double-precision exaflop barrier for scientific applications came on the last day of their Frontier allocation with the very last calculation of the simulation. It was recorded at 3 a.m. — not long after Barca had fallen asleep for the first time in a long time.
The team is currently working to prepare their results for scientific publication. After that, they plan to use the high accuracy simulations to train machine learning models and integrate artificial intelligence into the algorithm. The improvements will provide an entirely new level of sophistication and efficiency for solving even larger and more complex problems.
Support for this research came from the DOE Office of Science’s Advanced Scientific Computing Research program. The OLCF is a DOE Office of Science user facility.
source: Jeremy Rumsey, Oak Ridge Leadership Computing Facility