At SC18 in Dallas, I had a chance to catch up with Gary Grider from LANL. Gary currently is the Deputy Division Leader of the High Performance Computing Division at Los Alamos National Laboratory, where he is responsible for managing the personnel and processes required to stand up and operate major supercomputing systems, networks, and storage systems for the Laboratory for both the DOE/NNSA Advanced Simulation and Computing (ASC) program and LANL institutional HPC environments.
insideHPC: Gary, thanks for having me today. We haven’t seen each other for a while. I remember you and I were on a panel in Manhattan, I don’t know, something like 10 years ago at the Structure conference. Anyway, can you tell me more about this new organization you are part of?
Gary Grider: So we’re forming a consortium to chase efficient computing. We see many of the HPC sites today seem to be headed down the path of buying machines that work really well with very dense linear algebra problems. The problem is: hardcore simulation can often not be a great fit on machines built for high Linpack numbers.
The organization is called the Efficient Mission-Centric Computing Consortium (EMC3). Why EMC3? Efficient because the world seems to be headed down a path of chasing machines that are frankly Linpack killers. A lot of applications aren’t dense, many applications are sparse, some of them have very irregular meshes that don’t work very well in Linpack killer machines. They get very low efficiencies in the neighborhood of 1% or less and so we feel like we need to do more with our silicon and use more than 1% of it well. Why do I say mission-centric? Well, if your mission is do dense problems, then the machines that many sites are buying are fine, but if your mission is not that, you need to chase a different architecture. And so that’s why we’re forming this consortium, a consortium of both vendors that sell and produce technology for HPC and using organizations that have problems that are similar to ours.
insideHPC: Can you describe how this group is made up good– you’re from the National Labs, you’re a leader and basically– you’re a thought leader in storage, that’s the way I think of you and I’m sorry– I’m not trying to put you in a box, but who else is in that room?
Gary Grider: Los Alamos is certainly extremely interested in the EMC3 goals but there are other HPC providers and using sites showing interest. The first vendor that joined was DDN actually and they’re very interested in failure at scale of disk drives, for correlated failure, something LANL is interested as well.
insideHPC: A storage company?
Gary Grider: Yes. You might ask why a storage company for EMC3, well efficiency is not just limited to compute/memory. We want efficient infrastructure like storage as well. Mellanox is joining and they’re very interested in trying to figure out how to compute in the network because using dark silicon all over the place will increase the efficiency. Cray is joining it because they’re interested in our Grand Unified File Indexing technology which allows for users to efficiently do file metadata management. In the processor/memory area, perhaps you saw that we had a recent press release about funding Marvell and another regarding installing a modest Cray/Marvell ThunderX2 XC50 system. This effort is all about moving HPC arm solutions towards higher efficiency over time. We measure efficiency along multiple dimensions, usable ops/watt and usable ops/capital dollar. Notice the word “usable” is used which is important as we are not interested in some peak flop/op number, it’s about how much useful work we get. We also care about efficiency in wall clock time for workflows which brings in network/storage/system software concerns as well. Finally measure efficiency via usable science per programmer hour. If we have to completely rewrite codes to gain efficiency, that may not be as efficient as a solution that perhaps is not as performant but takes way less manpower on code rewrite. Anyway, the support of the Arm ecosystem for HPC is an important way to encourage an alternative potentially more efficient architecture.
insideHPC: With the Arm guys?
Gary Grider: Right. So they have the new ThunderX2 chip and it’s a pretty good chip. It’s got more memory bandwidth than most out there today, but where we’re headed is a server class chip in 2020 or so that would have way more bandwidth, give us way higher efficiencies than what we can get today and give us more potential options for higher efficiency computing for our next set of machine purchases.
insideHPC: Yeah, so Gary, what is the endgame? Is it produce a new silicon that has different characteristics? Or what’s the endgame if you are totally successful, what is that result?
Gary Grider: The result is much higher efficiency period, rand we don’t really much care how to get it, although it feels very much like it’s not just going to be us changing our software, it’s going to be people changing silicon to work better with our problems. And that’s very likely going to be much, much higher bandwidth memories, maybe lower latency kind of capabilities. Maybe we’ll go back to the old very high bytes provided per clock and scatter gather kind of things that we used to do. All those things that allowed us to run at much, much higher efficiencies in the past that are gone, we need to bring some of those back.
insideHPC: So when you talk about that, is that like in the vector days, like the old CRAY X-MP kind of thing where it was only four processors but it was optimized for really fast memory, etc?
Gary Grider: Yeah. In fact, let me give you an example of that the Cray-1 X and Y family, they had 24 bytes of memory bandwidth per clock or op, right. Two 8-byte words read in to multiply them together and write one out 8 byte quantity every processor clock.. Every cycle you got 24 bytes of in and out of memory. Machines that are being purchased today are 0.1 bytes per clock, or worse. Sure there are registers and cache and creative compilers/programming environments that help you, but this is one of the largest factors in complex simulation getting such poor efficiency. We are talking 2 orders of magnitude less memory bandwidth per clock. That’s awfully hard to overcome.
insideHPC: So you’re going backwards?
Gary Grider: We’ve been going backwards since beyond the Cray Y-MP, right?
Yep. Memory bandwidth is killing us but memory latencies are even worse if you have an indirect references and if you have to branchy code, data and instruction streams stall quite frequently especially in complex simulation with irregular data layouts. We’re talking about hundreds to thousands of cycles to recover from some stalls. Memory bandwidth and latency are at the heart of why we’re at very, very low percent of peak. We feel like we need to do something about that as a community and we’re starting an effort to try to do that.
insideHPC: So Seymour Cray delivered the first Cray-1 supercomputer to you guys at Los Alamos with no operating system. That’s how important what you’re describing was to the lab mission?
Gary Grider: In fact, it was worse than that. The first Cray-1 worked find in Minnesota and didn’t work at all in Los Alamos because the memory wasn’t protected and we had so many cosmic events that it would only run for about a minute and then die. So the first six months were send it back and get the memory redone. We wrote an operating system for it because there wasn’t one that was the thing called BOS, which was more or less a like a post today. But yeah, we were deadly serious. We needed a machine that could get way more throughput than the CDC at the time. The same thing is true today, we need a major leap forward in efficiency for our complex simulations which are highly irregular and suffer from indirection and instruction branching and all the things that make memory bandwidth and latency our top problem.
insideHPC: So Gary, fast forward to today, nobody can afford to build a big SMP with that kind of crossbar because the market for it something you could count on your fingers. So the big vector machines are long gone.
Gary Grider: I don’t think we’re really going to chase a huge SMP. But basically if you look at the building blocks that we build machines out of today, they’re machines that have 0.1 bytes delivered per flop, and if we can just get the piece parts up to one byte per flop, that’s a 10X jump which we think will give us a jump in overall application efficiency of multiple times. And that’s a big deal, right? Because we’re going to spend a lot on a machine and what we are effectively getting is a small fraction of the potential value because architectures have moved away from serving our kinds of applications well.
insideHPC: Yeah. 1% is all you’re getting out of effective performance…
Gary Grider: Right? And that’s true of most complicated simulations. It’s not just us, it’s a petroleum business, it’s a DreamWorks, so it’s many if not most sites that have highly challenging physical simulation as the bulk of their computing needs. Many if not most that are doing hardcore physical simulation on todays machines are simply not getting very high efficiencies.
insideHPC: So Gary, I know you are in the early days, but what are the next steps? I mean, you formed this thing and now you’re getting on some heavy hitters in the industry to say, ”We’re with you.” What happens now?
Gary Grider: Well, the next steps are to plan the technology investments we’re going make to accelerate different features. We need help from the HPC providing industry players that are willing to partner with us that want to help chase higher efficiency for simulation and similar problems. And we also need other using organizations, both US government and commercial, to join in and make sure the HPC industry players understand the importance and size market that need this higher efficiency for complex problems.
insideHPC: Well, Gary, my audience is mostly users and a lot of vendors as well. If they’re listening to this and their ears perk up, how do they engage with this?
Gary Grider: There’s two press releases out there so far and you can just go to EMC3.org.
insideHPC: Hey, great. Thanks for sharing this. I hope you have a great week here in Dallas. I hope you do too. Thank you.