Durham University Cosmologists, DiRAC and Dell Improve HPC Energy Efficiency without Sacrificing Performance

[SPONSORED GUEST ARTICLE]   Since the beginning of human existence, people have looked into the night sky and wondered how the universe began. Cosmologists using the UK’s Memory Intensive DiRAC HPC service at Durham University, with support from Dell Technologies, are working toward an answer to that and other questions about our universe.

When it comes to scientific computing, cosmology requires notably intensive, HPC-class compute and memory capacity. Durham has partnered with Dell since 2018 on custom compute resources for their advanced research needs to deliver a Memory Intensive compute capability for DiRAC (Distributed Research using Advanced Computing facility). Durham’s current, RAM-weighted (half a petabyte, along with 14 PB of spinning disk storage), 70,000-core cluster supports more than 1,000 DiRAC researchers at Durham and around the world. The university’s most recent system upgrade, “COSMA-8,” utilizes custom-outfitted Dell C6525 servers, including Dell R6525 nodes with AMD Milan processors .

According to Alistair Basden, head of the university’s COSMA HPC Service and a member of the DiRAC Technical Directorate, COSMA-8 simulations can literally take months to complete.

“Once those large simulations have completed, they are then used for years, or even a decade or more,” Basden said. “They’re analyzed by generations of researchers, PhD students, post-docs and academic staff. There’s a lot of statistical analysis done to match and compare what we get from the simulation with what we see with the James Webb and other high-powered telescopes. So these datasets are very valuable for their scientific output, for years they are the state-of-the-art.”

Basden heads a team of six COSMA-8 system engineers, and they work closely with Dell support staff on system upgrades and ongoing maintenance and tuning.

Durham University and DiRAC chose Dell six years ago, Basden said, “….basically because they understood best what the problem was that we were trying to solve, and they came up with the best offering – the network, the storage, and the compute.”

“We do a lot of co-design work with Dell,” he said, “so when we’re designing a new system, we don’t just ask for a set number of cores or amount of RAM. We actually co-design this cluster, we work through a lot of benchmarks to figure out what the best configuration is in terms of RAM per core, in terms of the fabric and even the rack layout and how it’s cabled. It takes a lot of work, and we work together with Dell to design these things. We’re a Dell Center of Excellence for HPC and AI based on the expertise that we’ve gathered in this area.”

Basden emphasized that cosmology comes with specific system requirements.

“That’s one of the key points, which is that you could have a much larger, much more expensive, general-purpose compute system than COSMA,” he said, “but it would be less capable because it wasn’t designed for this particular task.”

An important challenge for DiRAC at Durham – just as it is throughout the HPC community – is controlling energy consumption and resulting CO2 emissions and electrical costs. This means balancing the scale of cosmology workloads with energy-efficient compute resources. This is the focal point of a recent case history from Dell, DiRAC and Durham in which Basden and his staff compared the performance of three generations of AMD processors running the AREPO and SWIFT cosmology codes.

The study involving several major cosmology codes showed that both Genoa and Bergamo can offer a significant improvement, sometimes more than twice as fast than the Milan or Rome generation of processors.

Bergamo and Genoa show best performance on energy consumption to achieve a given science output, requiring less than 85 percent of the energy of the Rome processor to complete the task. Because the power envelope of the newer processors is only around a factor of 30 percent greater, the improvement in science per watt is significantly improved. Researchers observed performance gains of more than 1.5 times per watt, yielding significant carbon dioxide reductions.

For more details, keep reading.