[SPONSORED GUEST ARTICLE] When we won the tender to build the first European exascale computer at Forschungszentrum Jülich (JSC), we knew it would be a tough gig. Connecting 24,000 GPUs mounted in 125 racks, with 36,000 high performance network cables for a machine which uses 17MW running full tilt – and getting it all running smoothly as a single system is hard – but doing all that to a very tight deadline, and with no data centre to put it in is virtually impossible.

This article is intended to peel back the covers on how Eviden did that, look at some of the problems we faced, and tell a little of the story of what we achieved.

Because there was no datacentre, JSC took the responsibility of providing us with a concrete plinth. Our modular datacentre (MDC) team worked alongside Eviden’s HPC systems team to design and build the system, ship it to site, and prove it all worked by running the HPL LINPAC benchmark in time for the TOP500 submission at ISC25 conference in Hamburg, June 2025.

Back in December 2024, we had a concrete plinth on which to put a data center and very little else. We had just started to ship modules, which would be connected together to provide workshops, storage space, and a data hall. People sometimes get the wrong idea about MDC’s – Eviden doesn’t see them as one size fits all steel boxes, but instead they are designed to their function and specific customer requirements – which in this case were provided by JSC. As a result, the MDC has much more characteristics of a regular DC than a set of shipping containers.

We were building compute modules at our factory in Angers. Each module contains 10 racks, 1920 GPUs, so by building them in the factory we could test blades, power, cooling and networking up to module level. We shipped these to site at JSC in January 2025, with a pair of modules (a ‘bimodule’) shipping every 2 weeks on trucks. Arriving on site, they craned into place, then we added the transformer units, and cooling towers for the bimodule.

This makes the bi-module a semi-autonomous unit, with electrical and other service separation from the other bimodules. There is a corridor which acts as the spine of the MDC, allowing engineers access to each bimodule but also serving as the conduit for network and power cabling across the system.

To lose as little time as possible when a bimodule arrived, we extended this corridor, moving a temporary door so that power, cooling and network connectivity work could proceed on the growing system. This meant that it took only 2 weeks following the final delivery to the site to power up all of the bimodules.

The challenge of shaking out the power and cooling subsystems and stabilizing the high-performance network could now begin in earnest. The last containers were delivered at the end of April, and the submission date for the TOP500 was May 23rd. So, we had hardware engineers shaking out the system in the morning to identify bad connections or clean optics. The stabilization team worked in the afternoon, running diagnostics on progressively larger segments of the machine from blade to rack to module and across bimodules. Then benchmarkers could run overnight – thankfully some of them are in the U.S., allowing us to use the planet to our advantage.

As anyone who has built big systems knows, while you can anticipate and plan for a lot of problems up front, unanticipated things happen which only emerge at the largest scale. Performance was growing fine as we added equipment, then suddenly at ~300PF it just stalled. No improvement from more kit. Very worrying, but NVIDIA brought in a subnet expert who was able to quickly identify the problem (an issue with adaptive routing). Thankfully, we were back on track after only a couple of days. After that the performance graph looked like a cliff face upwards.

The scale and complexity of this build is phenomenal. Without meticulous planning, organization, logistics, and execution across a range of partners, we would not have been able to get from no system to the 4th fastest computer on the planet, in a matter of weeks.

For more information on Eviden High-Performance Computing and updates on the JUPITER project visit us on eviden.com.

_________________________

About the author: Dr. Crispin Keable, Senior Systems Architect

With over 30 years of experience in HPC, including roles at Cray Research, SGI, IBM, and Bull, I have contributed to the design and delivery of some of the world’s most powerful supercomputers. Since joining Eviden (formerly Bull/Atos) in 2014, I have led projects across the UK, Europe, and globally, including the latest production system at ECMWF. I am currently designing the JUPITER Exascale Supercomputer at the Jülich Supercomputing Centre in Germany. My work is driven by a focus on system efficiency—maximizing scientific performance while reducing the energy footprint of large-scale systems.