Let's Talk Exascale: Chandrasekaran on Teaching Supercomputing and Leading ECP's SOLLVE Project - High-Performance Computing News Analysis

In this episode of the Exascale Computing Project’s Let’s Talk Exascale, the ECP’s Scott Gibson interviewed Sunita Chandrasekaran, the new principal investigator of the ECP SOLLVE (Scaling OpenMP With LLVm for Exascale Performance and Portability) project. She replaces Barbara Chapman in the role, whom ECP Software Technology Director Mike Heroux said has been an invaluable contributor to ECP. As for the SOLLVE project, it is advancing the OpenMP specification and its implementations to address exascale application challenges. Sunita will tell us more about that.

Gibson: Sunita, please give us a broad picture of your background and your work.

Chandrasekaran: I’m currently an associate professor with the Department of Computer and Information Sciences at the University of Delaware [UD], and the area of interest, or the research activities here with my group, has been on high-performance computing, machine learning, data science, and interdisciplinary science, where we work with several domain scientists—aka non-computer scientists—to understand their algorithms and work with them to advance their science and be able to port their programs on large-scale computers just like the upcoming Frontier, if you like.

I also teach computer architecture and parallel programming. In fact, I’m teaching parallel programming this semester. And I also teach a vertically integrated project at UD where several undergraduate students and I, we just sit together and we hack, and that’s a very nice hands-on course.

So all of this, I thought, was forming a very nice background for ECP SOLLVE, especially because some of my students and I have been already working on a subproject with SOLLVE over the past three, three-and-a-half years as of now. And this subproject has been on validation and verification of several OpenMP compiler implementations. So I should say that we are not super-foreign to SOLLVE because we have been part of a subproject. Having said that, that made a nice—how do I put it—it helped [us] understand what is going on within SOLLVE, and now with the solve PI hat on, I guess I’m trying to learn the different challenges, the different pieces yet leveraging the parallel programming and architecture and high-performance computing background that I’ve been gathering over the past several years. So I guess it’s a nice blend, and I’m also able to take pieces of the SOLLVE project and its activities and the challenges back to my class while I teach. And I’m already telling them about Frontier; I’m already telling them about TOP500 supercomputer[s] to look up, you know. Things may change in November at the SC21 conference when the TOP500 list gets updated. Right, so I guess that definitely is a strong connection between what I’m doing at UD and the SOLLVE PI activities and goals of the project.

Gibson: Is there anything you want to add about your work at Brookhaven National Laboratory [BNL] and the Computational Science Initiative?

Chandrasekaran: Right. So with Brookhaven Lab I took up this position of computational scientist with the Computational Science Initiative, directed by Kerstin Kleese Van Dam. I believe Brookhaven has already been involved with SOLLVE for the past couple of years with Barbara Chapman, who was leading the SOLLVE project. I believe several initiative within BNL—both on high-performance computing as well as machine learning and several of their domain science application physics-based projects. So I hope to bring forward some of the experiences that I have with machine learning projects here at UD where work with Nemours hospital for children and apply those machine learning techniques to physics-based problems in Brookhaven Lab. So it looks like there’s a ton of opportunities where both HPC and machine learning could be definitely leveraged for computational science initiatives within BNL.

Gibson: What is the ECP SOLLVE project all about?

Chandrasekaran: ECP SOLLVE, as the name stands for scaling OpenMP, which is one of the two directive-based programming models with LLVM, which is a cohort, a set of tools, software, functionalities, libraries for exascale performance and portability. The long and short of it is SOLLVE is tasked with several different pieces, and some of them include working with the OpenMP standard organization, where they ratify features for different types of functionalities when you’re trying to use two different types of devices—for example, CPUs and GPUs—which is exactly what the Frontier system is also going to be and many systems are. Broadly, we could call it heterogeneous systems, and how do you define what amount of work and data should be offloaded from one system to another?

So to do these and getting down to the bottom of it, there are several features that usually the OpenMP organization works on, but there also needs to be implementation of these features, right? And there are many vendors that are working on the OpenMP implementation, and SOLLVE is predominantly and closely tied with LLVM where our goal here is to be able to work on compiler, runtime, different aspects of OpenMP features and its offloading features to GPUs. With respect to Frontier, it would be AMD. With respect to Perlmutter in Lawrence Berkeley National Lab, it would be Nvidia GPUs, so there’s already a variety, because GPUs on a grand scheme of things help achieve the best performance for the large scientific codes. But fundamentally the way you would program them can be slightly different, and that’s where the compilers and other implementations come in place. So compiler work, runtime work, [and] offloading to GPUs is a giant piece of SOLLVE, I should say, and then comes validation, verification, and test suite, where we also write test codes to test or validate or verify the implementation from vendors—and these could be Nvidia, AMD; could be IBM, LLVM; could be anybody.

Anyone building implementations of OpenMP … How do we ensure that the implementation is conforming to the OpenMP standard specification? And has the implementation been really implemented correctly. So how do you validate and verify. That’s the other piece of the SOLLVE project, and here is where my students from UD have been involved with Oak Ridge National Lab over the past three or four years, and we divide and conquer the set of features to build test cases for. And we evaluate these test cases on pre-exascale systems—at Oak Ridge National Lab, also Cori [at the National Energy Research Scientific Computing Center, NERSC], and several other systems that we can possibly get access to. And we’re also trying to get access to early access machines at Argonne basically to find out, how are the implementations behaving across a varied set of systems?

The other piece of the project involves coordinating with the application developers, and this is equally important because, I guess, fundamentally you are creating implementations and validating them for whom? And this is for the users, basically. Who is using the OpenMP implementation? So we need to understand what the features the application developers want, and do the compilers have implementations for those features? Are these implementations validated and verified? Can there be performance benchmarking of these features? A matter of supreme importance to application developers. The full nine yards. Somebody taking up an application and someone wanting to use OpenMP and the group trying to understand, how much of performance can be achieved through this OpenMP core on the next big system? So there’s coordination to be done between OpenMP specification, OpenMP implementation, and OpenMP users, who are the application developers. So there’s a lot of coordination required, which is also part of our piece where we coordinate between different projects and try to understand the needs and try to connect the dots. All these lead to performance benchmarking where different implementations could perform differently from different vendors. So how do we draw a line there? How do we compare and contrast between the different implementations, and how is performance achieved across these implementations?

It’s a maze, but I think it’s important to find the path in this maze. The different pieces of this whole puzzle are well connected. So I guess SOLLVE tries to connect these different pieces of this ECP project by working with different teams doing different things.

Gibson: Sunita, given your particular credentials in HPC, what do you hope to bring to the PI role?

Chandrasekaran: I guess what I have enjoyed working on for HPC … For a long time, I was working on the programming models side of things, and that nicely translated into, or that nicely morphed into, okay, let us find out who wants to use the programming models, which means, who are the application scientists that would care to use programming models and features and stuff. So I believe with learning about how to talk to domain scientists, right, because domain scientists and computer scientists, we talk two different languages, which can be very funny in the very beginning because we’re trying to understand each other and trying to find out where the gap is and then trying to bridge the gap. So I believe with the past or the ongoing experiences of working with several domain scientists—including biophysicists, nuclear physicists, solar physicists, and these are my other collaborators and projects at UD with my PhD students.

I think I would love to accomplish through SOLLVE and my team is understanding from the application developers their needs while they are ready to target the next massively parallel supercomputer, Frontier, for example, and the one that’s coming after. And not just Oak Ridge National Lab but maybe ones in Lawrence Berkeley National Lab, the one in Argonne National Lab, and so on.

So there is a gap between what the application developers want to achieve and what the machine can offer, so application characteristics and machine characteristics, so how to marry this gap. How would you address the challenges? So I believe that is what I am trying to bring with my PI hat on—find out what are the software challenges with respect to OpenMP implementations, validation and verification challenges, performance benchmarking challenges, and find out what are the needs from the application developers and try to bridge that gap enough to have built this application-ready when the system is ready, which is actually pretty soon. And I believe my team … they have this expertise. Some are developers. Some are test writers. Some buildd performance benchmarking suites. Some are trainers for hackathons. So it’s a wonderful team to work with because different people in the team bring different expertise to the table.

Gibson: Especially in the context of ECP building an ecosystem for exascale, what would you like to see the enduring legacy of ECP SOLLVE be?

Chandrasekaran: I would love to see SOLLVE go beyond the immediate requirements of the project, because when we build software, the software needs to be sustainable, in the sense you’d don’t want to invest so many personnel’s time and effort—and money, of course—and build software that can be used only on a short-term basis. So we want to build software that can go beyond Frontier, that can go beyond Perlmutter, for example, and be sustainable, which means it’s even more increasingly important to understand the underlying hardware architectures. The project is structured in a way that we are dealing with abstractions. We’re dealing with abstractions in programming models, in the sense of, how from an application developer’s standpoint should the programming model look like enough for the scientists to be able to look at the code and know that this is still their code base, because the closer you get to the machine, the more complicated the code becomes because you’re touching the wires and logic gates and whatnot.

So coming back to the main question that you are asking, I guess it’s important to build software that goes beyond a couple of years, which means software needs to be sustainable, which means it needs to be well validated, well verified, well evaluated across more than just a few systems. So building the grand table of X applications, Y compiler applications, Z platforms, and just looking at the table, right, gives us a ton of info as to what else should we address or where’s the gap, or what are the challenges? So I guess that is what I would like to get to, because a few years from now, I would want SOLLVE to still be part of this ECP legacy and the huge software stack that I guess we are all thinking about right now.

Gibson: Is there anything else that you’d like to cover for our audience?

Chandrasekaran: Sure, I would like to emphasize the importance of interdisciplinary science. I understood the meaning of it and I appreciated the need for different scientific disciplines to come together only after I started to work with several domain science collaborators. It takes time because we are expressing the code in two different ways. The domain scientists tend to express the science algorithmically for the science they are solving, and computer scientists tend to express algorithms for the machine they are targeting. So they are two different problems. And I think it is really important even with my SOLLVE hat on and my teacher hat on or my advisor hat on back with my PhD students … We keep asking about, how do you ask the right questions to an interdisciplinary scientist to understand their problem, because I believe this is more than just programming—this is scientific advancement. Programming enables scientific advancement. So I guess it’s an art to understand the science and be able to advance it using software tools and techniques. I’m glad high-performance computing is able to open doors to observe how huge applications—be it weather, be it COVID, be it next-generation sequence alignment, be it solar physics, how HPC is enabling all of this and moving code to bigger and bigger systems, which wasn’t possible even ten years ago. And I think this needs to be kept on. So that would be my two cents to the next-generation workforce.

Gibson: Well, thank you so much for being on the podcast.

Chandrasekaran: Thank you for having me.

Let’s Talk Exascale: Chandrasekaran on Teaching Supercomputing and Leading ECP’s SOLLVE Project

Sponsored Guest Articles

Hammerspace Unveils the Fastest File System in the World for Training Enterprise AI Models at Scale

White Papers

Energy efficiency drives HPC to the cloud

Featured RSS Feed

More News from insideBIGDATA