Dell Technologies Interview: Getting ‘More Science Per Pound’ at Durham University’s COSMA HPC Service

Print Friendly, PDF & Email

[SPONSORED CONTENT]  In this interview with Dr. Alastair Basden of the UK’s Durham University, he discusses the latest activities at the university’s COSMA HPC Service as it tests and incorporates new high performance technologies on its way to exascale. A Dell Technologies HPC and AI Center of Excellence, the organization is driven to generate “more science per pound” out of its memory-intensive HPC infrastructure, Basden said, while also updating us on scientists’ cosmological work, including filling in the remaining gaps in the Big Bang Theory.

Doug Black: Hi everyone, I’m Doug Black, editor in chief at insideHPC, and today as part of our series of Dell Technologies interviews we are with Dr. Alastair Basden, he is head of COSMA HPC Services at Durham University in the UK. COSMA stands for “cosmology machine,” and cosmology is the science of the origin and development of the universe, the Big Bang Theory and all that. COSMA HPC services at Durham is comprised of a battery of HPC systems and the facility is a Dell Technologies HPC and AI Center of Excellence. Alistair welcome.

Alastair Basden: Thank you.

Black: I understand that Durham is on the path to exascale and that you continually test new technologies. Tell us about some of the technologies you’re looking at as you move toward that milestone in HPC.

Basden: We’ve got a few things that we’re testing that are either novel technologies or that we’re on the leading edge of. We’re always very interested in CPU technologies, we were one of the first HPC facilities a few years ago to get the AMD EPYC chips. And we have some cutting edge(AMD) “Milan” chips … We’re also very interested in HPC fabrics, so we are doing some tests with Rockport Networks at the moment, which is a 16 cores network switchless fabric. So very soon we’ll be upgrading one half of one of our clusters to use the Rockport network. We’re also looking at bluefield technologies and DPU technologies, and seeing how that can improve our scientific codex.

A lot of our investigations really are looking at how we can get more science out of the machines we’ve got, or, when we’re designing future machines, really how we can get the best value for money in terms of science per pound. And we’re also interested in carbon savings, so we have a water cooled system, our latest system is on chip, direct liquid cooled.

COSMA (credit: Durham University)

And we’re very interested in composability as well. So we have a composable GPU system. What this does is allows us to have a number of physical GPUs, which we can then at the click of a button just move them between different servers. So if there’s a job running that has a demand for a few GPUs, we can provide those. At the same time, there might be jobs that don’t need any GPU, so they can then run on empty servers. And then at some point, we might have a number of jobs that will need a set of GPUs, we just spread them out evenly over this system – that sort of thing. So we’re interested in those sorts of technologies as well.

Black: Okay, great. Please update us on the cosmology work you and the organization are doing and the role of HPC moving forward.

Basden: Our facility is funded by one of the UK Research Councils, and the research remit of this council has to do with cosmology, particle physics, astronomy, nuclear physics, black holes and all that sort of thing. So our system here at Durham is what we call a memory-intensive service. It has a high amount of memory per core, our current system has a terabyte of RAM per computer. And what we’re looking to do with this, one of the major workloads that we run is cosmology simulation. We have a simulation to start with the Big Bang in the universe and propagate that through time. And by tuning different input parameters and different models – things like dark matter and dark energy and so on – we try and match what we get in simulation with what astronomers see in telescopes in real life. By doing a lot of statistical analysis we’re then able to fine tune the input parameters of the models to get a better handle for how the universe is made.

Black: Out of curiosity, would you say that as your work progresses, is the Big Bang Theory increasingly validated or are there holes in that theory?

Basden: It’s not that there are holes on it. It’s validated. It’s just bits of it that we don’t understand. There’s always these unknown. We don’t understand really, what 50 to 75 percent of the universe is made from, this dark matter. We don’t really understand what it is, where it comes from, how it interacts, so it’s finding out more and more about the universe, that’s key.

Black: Okay, so the premise stands, but it’s filling in the picture, I see. Now, can you give us a profile of the Dell server and cluster technologies that you have in place?

Basden: So we’ve got a couple of generations of system at the moment which are provided by Dell. We have a system that was installed in 2018, … (Dell EMC PowerEdge C6420) series servers. And that’s getting on for 500 nodes of that. Then our newest system was installed just last year, in 2021, that’s, again, similar servers, four servers in 2U. And they’re the Dell C6525 series (servers) with AMD EPYC chips inside them. These are the ones with a terabyte of RAM per node.

We also have a smaller, 24-node test cluster as well, which we use for testing novel technologies. So that’s where the bluefield tests are done, that’s where the Rockport (Networks) tests have been going on. And we’re … looking at converting half of COSMA into this Rockport stuff. So in terms of the Dell kit, it provides all the service for this, we use the Dell hardware for our storage as well. So we’ve got multi-petabyte Lustre file systems, which again, is running on Dell hardware.

Black: Okay, and tell us about your Center of Excellence partnership with Dell and the value of that to the organization and (for) developing your HPC infrastructure.

Basden: One of the key things that (partnership) gives us is access to Dell engineers, so we’re able to get a feel for what’s coming. This is one of the reasons why we were involved with Rockport before they went public. (Dell) put us in contact with Rockport through this Center of Excellence. And we were able then to start testing that sort of kit. So it really gives us early insight into new technologies, new interesting technologies, things that are useful for us. And by doing this we’re able to (give) feedback to Dell about how useful this sort of kit is going to be, and where we would like future technologies to take us. So it’s a two-way thing with benefits for both parties.

Black: All right. Well, it’s been a pleasure. We’ve been with Dr. Alistair Basden of the COSMA HPC service at Durham University in the UK. Alistair, thanks so much.