Dell Technologies Interview: Univ. of Liverpool’s Hybrid HPC Strategy Boosts Scientific Computing with a Burst

Print Friendly, PDF & Email

[SPONSORED CONTENT]  In a recent Dell Technologies interview on this site, we talkied about HPC-as-a-Service with R Systems, provider HPC-on-demand resources and technical expertise in partnership with Dell HPC Cloud Services. Now, in this interview, we have a variant within this HPC segment: bursting to the cloud when an on-premises cluster needs a resource boost.

Faced with this situation was the University of Liverpool’s Advanced Research Computing within the Computer Services Department. The group, led by Cliff Addison, uses the Dell-based “Barkla” Linux cluster for its scientific computing needs. For times when the group’s needs overtax Barkla, the university worked with Dell Technologies and UK-based Alces Flight, which designs and builds HPC environments for scientists, engineers and researchers. UK-based Alces and Dell engineered a burst capability to Amazon Web Services, placing priority on creating a seamless environment easily adopted and accessed by Advanced Research Computing scientists.

In this interview, Addison explains – among other things – how the AWS capacity was utilized when the COVID 19 pandemic hit.

Doug Black: Hi everyone, I’m Doug Black, editor in chief at inside HPC, and today as part of our series of interviews on behalf of Dell Technologies, we’re speaking with Cliff Addison, he is head of advanced research computing at the University of Liverpool. Cliff, welcome.

Cliff Addison: Good afternoon, or good morning, depending on what time of day it is. But yes, okay.

Black: So please give us an overview of the HPC system the university set up with Dell’s integration partner, Alces Flight. Now a key aspect of the system, as I understand it, is that it bursts to Amazon Web Services for additional compute and storage resources. Is that correct?

University of Liverpool (wikipedia)

Addison: That’s largely correct. What we did – I’ll step back a bit. In 2017, when we went to tender, we had a number of researchers who had grants that they wanted to use to buy equipment. We needed to be able to have things which were demonstrably high impact, and we also needed to have an environment that could be expanded to adapt to changes as the computer requirements of our research has changed. And we were also looking for something that basically provided a good deal of compute power right from the get go.

And Dell responded to this with a partnership with Alces Flight and also working with Amazon Web Services to provide us with a system that was very strong on-premises, with very cost-competitive hardware and a very good setup that our researchers took to immediately.

In addition, we started off with a large number of credits from AWS to be able to start working with the cloud, and Alces Flight used their expertise to set up a fairly seamless cloud Barkla environment where we could jump quite easily from the on-premises system onto the cloud system with the same users, same storage and a very familiar environment for the researchers. So the researchers didn’t really need to worry about a different environment in the cloud, it was a very similar environment to what they had already. And those features together really were very strong advantage. And I’ll talk a little bit later about some of the ways that’s worked out for us.

Black: Okay, so let’s move on to the work your organization is doing. What’s new in the Liverpool University Advanced Research Computing Group?

Addison: Computational chemistry at Liverpool has always been one of the major user of our facilities. And 10-15 years ago, that was large-scale, parallel, molecular dynamics and…calculations. But what’s happened over the years, and this is consistent with several other groups, is that they’ve moved to a very sophisticated workflow environment where they’re doing the detailed studies occasionally, but they’re driven by very large numbers of fast investigatory tests along with some machine learning to help guide things.

And so instead of just doing a lot of computational runs, we see them doing a very high mixture of very fast investigation runs, machine learning, and then some detailed calculations on certain aspects of molecules that we thought looked promising. And that’s one of the general trends that we’re seeing.

Now in addition, with the COVID 19 outbreak that we had, we had several specific requirements that came up. And again, the cloud Barkla environment with the AWS cloud, bursting was fundamental to being able to start. There was one of our groups that was doing deep learning to try to look for COVID detection in computer tomography images and X-ray scans. And they just didn’t have the resources available. We applied to AWS and we were given research credits, and then again, with the Alces Flight environment, these researchers were able to seamlessly get on to AWS, do some of their data analysis / data cleaning on the local cluster, and then very seamlessly move on to the GPU nodes on AWS to do the detailed computing. And that worked extremely well, we were able to present results to the Supercomputing 2020 (conference). And they’re just recently submitted an online journal of their results that’s in the process of being accepted.

Black: So Cliff, you all started with the Barkla cluster in 2017 – tell us about the evolution of the system’s capabilities, in terms of nodes, and the current updates you’re working on now.

Addison: Well, we bought the system with a good deal of expansion capability in mind. We started off with 96 Skylake nodes, each with 40 cores. And we’ve been able to expand that over time to have 140 nodes now. I’m satisfied a lot of the research groups that have been working on it have been very happy with that result.

But recently, another research group came to us and said we would like to have some enhanced GPU capability for our PhD students. We think we probably also need some fast storage to sit behind that. And I was able to contact Dell and Alces Flight, and they were able to come back with some ideas in terms of (NVIDIA) A100 nodes and some fast NVMe storage. And when our researchers looked at the options, they were very pleased. And we’ve just decided on a mix of configurations, and Dell and Alces are now going to put that together. And hopefully we’ll get that later on in the year.

Black: Nice. Really interesting. So now with the pandemic, and more work and learn from home going on, how has that impacted your team?

Addison: Well, it’s interesting, our team has managed fine. We are able to get good remote access to our services on premises. And again, the hooks into the cloud are basically via that on-premises system so we can get onto the cloud  whenever we needed to. It was the research — it’s a struggle, because of course, one of the major lessons learned is home broadband is not nearly as fast as a good university network. And so we had researchers trying to download large application packages that were 10s of gigabytes in size to run on their home systems. And we kept saying it’s best not to do that, it’s best to use our facilities on campus. And don’t do the heavy compute on your home systems. And eventually, I think we got that through. So once people accepted how to use the on-premises systems better, that worked out fine, but our researchers did take some time to get used to it, particularly when they’re dealing with large datasets.

Black: So generally speaking, how important is the AWS burst connection for you? And do you have any tips for other HPC site managers?

Addison: One of the things that we found was we liked AWS, we liked AWS people. The environment does have a reasonably steep learning curve, and you need to use it quite a bit to become familiar with managing it. But Alces Flight as a third party provided a very seamless environment. And there are several other companies out there who are able to do similar sorts of things. And I would encourage HPC groups to look to try to partner up with somebody who has that expertise rather than try to reinvent things for themselves. It makes a huge difference to be able to have someone else manage this, to set things up, do the accounting for you, do the node setup, making certain that when nodes aren’t being used or powered down to you’re not paying for it – these sorts of things. It really does make it a much more pleasant experience.

Black: Yeah, that sort of a smooth transition back and forth, cloud back to on premises. That’s absolutely a big key so that people aren’t constantly struggling with learning to reuse a user interface.

Addison: That’s right. But also from the local point of view, oftentimes certainly we have issues with being short staffed in terms of HPC people and we don’t really have the extra capacity to do a lot of the firsthand cloud management that would be required for such a good environment. So being able to work through a third party makes life considerably easier for us. We can concentrate on helping the users, we don’t need to worry about the management and the accounting directly. We’re able to do that through a third party, and that we found was a big, big win.

Black: Great. All right, Cliff. Well, it’s been a pleasure speaking with you. We’ve been with Cliff Addison at the University of Liverpool’s Advanced Research Computing Group. Thanks so much.

Addison: Thank you very much.