In this interview conducted on behalf of Dell Technologies, insideHPC spoke with Carol Song, who leads the Scientific Solutions Group at Purdue University’s Rosen Center for Advanced Computing and is a senior research scientist for Information Technology at Purdue (ITaP) Research Computing.
Song is the principal investigator (PI) and project director for Purdue’s Anvil supercomputing cluster, built in partnership with Dell. Anvil consists of 1,000 Dell PowerEdge server nodes with two 64-core AMD EPYC “Milan” processors each and will annually deliver over 1 billion CPU core hours to the National Science Foundation’s (NSF) XSEDE (Extreme Science and Engineering Discovery Environment) program, with a peak performance of 5.3 petaflops. Anvil’s nodes will be interconnected with 100 Gbps Mellanox HDR InfiniBand and include 32 large memory nodes, each with 1 TB of RAM. It also includes 16 PowerEdge server nodes each with four NVIDIA A100 Tensor Core GPUs providing 1.5 PF of single-precision performance to support machine learning and artificial intelligence applications.
Anvil, funded by a $10 million NSF award, will leverage a diverse set of storage technologies anchored by a 10+ PB parallel file system boosted with over 3 PB of flash disk.
Here, Song discusses her background, her leadership experience at Purdue, the goals driving the development of Anvil and where the system fits in Purdue’s HPC infrastructure.
Doug Black: Could you tell us a little bit about your background and some of the highlights during your leadership tenure at Purdue.
Carol Song: Definitely. I graduated from the University of Illinois in Champaign Urbana with my PhD in computer science. And after that, I actually did a whole bunch of things, working in the industry on medical imaging and also at networking communications companies as well — all startups and big corporations. In 2005, I joined Purdue, and the reason I joined Purdue was that at that time, Purdue was really taking off in high performance computing.
Now Purdue has a long history of doing high performance computing, even before then, but in 2005 was around the time when the TeraGrid Program started, so I was at the right place at the right time. So I joined Purdue and I’ve been leading the Purdue HPC program since then. Through TeraGrid, I’m also the PI for the current NSF XSEDE program leading our staff in serving the national science community. And my work also covers building data frameworks and science gateways, which are pieces that connect these advanced cyber-infrastructure resources to the end users — the end users being researchers. And Anvil is really the high point of Purdue research computing in my own career.
Black: Okay, thank you for that. We’re interested today in hearing about the new Anvil supercomputing cluster under development at Purdue in partnership with Dell, please tell us about the scale of the system, how many servers it will include and the expected compute throughput it will deliver.
Song: Absolutely. So Anvil is funded by the National Science Foundation in its advanced computing systems and services program. It’s funded as a Category One capacity system, so there are a couple of things we focused on. One is capacity, both in terms of the quantity of computing hours we can provide and also the state of art advanced HPC technologies that’s in the system.
The second thing is usability. So NSF wants these systems to be highly accessible and usable to a broad range of researchers around the nation. In terms of scale, Anvil has 1000 compute nodes, each of them features AMD Third-Generation (EPYC CPU) processors and it has a peak performance of 5.3 petaflops. Accompanying the compute nodes, we have a 10 petabyte storage system and three petabytes of all flash memory to accelerate the data movement within the system.
Anvil is a comprehensive system. By that, I mean, it also has other components — for example, 32 large memory nodes — to support applications that need to load in a lot of data all at once. It also features 16 GPU nodes, these are the newest GPU nodes from NVIDIA, providing an additional 1.5 petaflops of single precision computing power.
Black: How does Anvil fit in with Purdue’s overall HPC infrastructure?
Song: I love that question. First of all, it is the largest system we’ve ever built at Purdue. Ever since 2005, we’ve been building pretty much one big cluster every year. So we’ve built 15 or 16 clusters, I lost count. Anvil has 1000 nodes, so obviously it’s the largest and also it features the state of the art HPC hardware. Where it fits — it’s the largest capacity system we’ve ever built, it’s also the most diverse. It includes various components that are all integrated in one place, as I mentioned, the large memory nodes, the GPU nodes, along with large compute nodes.
We also we have a composable system as part of Anvil, basically an on-prem cloud system that’s orchestrated by Kubernetes. And this gives us the capacity to not only support the traditional HPC computational jobs, but also the newer, more heterogeneous workflows that researchers encounter every day these days. This could include both simulations and data analytics, and also ways for them to share their software, their data and their workflows with other researchers.
To elaborate a little bit on where it fits with Purdue — so for these NSF funded systems, there’s 10 percent discretionary cycles that are available to Purdue. And with that, we plan to really leverage that capacity for establishing and supporting important initiatives, such as industry partnerships and collaborations in large important programs that, without such a large system, would be impossible to support.
Black: We understand Anvil will have a role with the National Science Foundation’s XSEDE program. Please tell us about the impact the system is expected to have for research discovery and XSEDE.
Song: Anvil is integrated with XSEDE. That means it’s allocated through the XSEDE allocation process. It’s a peer review process, researchers around the nation can submit their proposals for hours they need to use on the XSEDE systems, so Anvil is allocated through that process. Anvil most importantly provides 1 billion CPU core hours to XSEDE users every year, and it also provides access to GPUs and large memory nodes in our cloud subsystem. And it’s also integrated with XSEDE through the frontline user support and training. XSEDE has a wealth of training materials that are already available. We’re going to contribute to that training program and our users could benefit from the current materials.
And as part of XSEDE, I think it will also help us broaden our partnerships with a broader community. As an example, Anvil is now part of the current COVID-19 Consortium, which provides computing power to researchers who are studying problems associated with the COVID 19 pandemic.
Black: What other challenges are what challenges in general are you hoping to address with Anvil? Tell us about the workloads you expect Anvil to take on?
Song: Yes. When we proposed Anvil, as one of the major issues for the national research community is that there’s not enough cycles. So there was a lack of capacity, and the XSEDE systems at the time are always oversubscribed, over-requested. So bringing in such a large-capacity system will definitely help solve that problem. And so Anvil is targeting moderate-sized computational jobs, which is also the bulk of XSEDE workloads.
Some of the challenges we’re focusing on improving – for example, accessibility is one of our focuses – is that we provide interactive computing environments that would help users to ramp up into high performance computing. Because a lot of the communities we’re engaging now, they don’t use HPC traditionally, for example, liberal arts, researchers in psychology, in geography. Their applications (don’t fit) the traditional HPC kind of mode of operations. So we’re providing the environments and software tools to help them get on HPC faster and make HPC easier to use.
The other challenge is that science is becoming more and more data driven. And the workloads very often are workflows, so it’s a sequence of steps researchers have to go through. That may include computing-intensive simulations and data driven analytics. And also, as part of their research process, they want to share their software and tools and data with other researchers, and possibly also with the general public. So (by) having a comprehensive ecosystem that includes different capabilities in Anvil under one system, we hope to address these increasingly complex workflows.
Black: Carol, thanks for that update from Purdue University. We’ve been with Carol Song, senior research scientist at Purdue in the Rosen Center for Advanced Computing. On behalf of insideHPC and Dell Technologies, it’s been a pleasure to be with you today.
Song: Thank you very much for having me.