By Mike Bernhardt, The Exascale Report
The award of Titan to Cray is a milestone event for the HPC community. But for many of the hardware and software engineers at Cray, it’s just another day in the office.
This is what Cray does. They build very large systems and they push the envelope.
But when the team at Cray talks about Jaguar and the other milestone systems they have delivered to their customers, they don’t brag about their engineering prowess. They beam with pride talking about the science and discovery their systems are making possible.
The Exascale Report is pleased to bring you this feature interview with Pete Ungaro, CEO of Cray, Inc.
The Exascale Report: So Pete, congratulations on behalf of an entire community on what you’ve been able to accomplish here – I know it’s been a very long time, and a lot of work put into this, and now we’re finally able to talk about Titan and all the great things this system will be able to do.
UNGARO: I just want to say thanks. It’s been a long road but it’s been a great one.
TER: So I’d like to start off by asking you what Cray has learned over the past year that you believe will enable Titan to hit its goals.
UNGARO: Well we know now that we can build systems at this scale. Jaguar was roughly the same size machine – 200 cabinets, so we know that we can build systems at that size and scale them up and hand them over to the researchers who can get some incredible science out of them. We feel like this is a pretty straightforward process for us. We’ve been able to get our architecture, our interconnect, our software all up to this scale and we think Titan just represents one more turn of the crank for us along that line.
TER: But this is obviously beyond anything that’s taken place up to this point, and knowing how volatile we are when we start to push the envelope, why do you have such confidence that Cray can deliver on such an ambitious project especially after what we saw with IBM / NCSA / Blue Waters this year?
UNGARO: I really think that the key for us is that we’ve really built our company to be focused on delivering these kinds of systems to the market. We’ve delivered the most high-end supercomputers in the world. We’ve been able to deliver systems like this over and over again. And while you are absolutely correct that Titan is going to a huge step beyond where we’ve been in the past, from a systems architecture and from a delivering the systems standpoint, we believe that it’s a pretty straightforward path for us. I think our sole focus on supercomputing, is really the biggest area that differentiates us from our competition today. Not just about our technology which is also pretty strong, but I think it’s building a company like we’ve done that focuses on one part of the market and doing things like this, along with some great partnerships with our customers like Oak Ridge, we’re able to do this pretty successfully and we’re able to repeat it over and over again.
TER: How about for the users? They are going to be exposed to an application development environment that in some ways may be unlike anything they have had to deal with in the past – the programming environment for sure – so what will you be able to do to help transition the users from Jaguar to Titan?
UNGARO: That’s a great question Mike. And maybe even the most important part of our story and one that we spend a lot of time with Oak Ridge talking about. All the applications that today run on Jaguar will be able to be run on Titan, so there is a similar AMD Opteron computing environment, a similar program model and such. Where the difference comes in is all the GPUs that we’re adding. And so we’ve developed a programming model that allows users to much more easily take advantage of GPUs. So in addition to just using CUDA programming which most people use today, we can take advantage of a more directives-based approach using our complier or Portland Group’s. I think one big advantage to our roadmap right now, with our overall vision of adaptive supercomputing, it really says that within a programming environment, or programming model, we can take advantage of these different kinds of processing capabilities. In fact, we have a lot of experience that says when we start to use our programming model to code for a GPU it actually improves its performance on a CPU. That’s been a big leverage point for a lot of users to go through the effort of programming for GPUs because they know that their codes will not just run faster consistent with GPUs but also on systems with CPUs and today, being able to be portable among many different machines is very important for the user community. So I think you have to have a programming environment that allows for that smooth and easy transition but also compels the programmers or scientists to want to do the extra work to take advantage of the underlying hardware.
TER: So, as you are figuring out what the production environments will look like for a system like Titan, will the compilers and debuggers, you mentioned Cray’s and the Portland Group’s, will they have to be modified to a significant level in order to work for that environment?
UNGARO: Yeah, there is definitely work that needs to be done in companies like us and The Portland Group, and Pathscale, you know a number of companies that we’ve worked with have been starting that work for over a year or two now, so there’s been a lot of progress in the marketplace to enable for these new kinds of hybrid architectures, especially where you have a CPU and a GPU which is a quite different architecture than the more traditional one where you have a couple of CPUs on a single node. So it’s been something that we’ve worked with our partners quite a bit with over time and we also of course have a lot of our own tools from compilers to debuggers to performance toolkits to different monitoring applications that we’ve had to enable for this type of environment. So it is a lot of work, but the key has been to try to do that work without forcing the user to completely rewrite their entire application. So, how can we do that in such a way that allows the user to bridge from one environment to the other – and back and forth.
TER: So Pete what is really different about building a system the size of Titan as compared to some of the commodity clusters with GPUs?
UNGARO: Part of what goes on today is that we get so focused on talking about the peak performance of the machine or the Linpack performance of the machine, that we forget that there are different types of machines, and that real sustained performance is really key. I think that what makes a Cray system different from building a ten petaflops GPU system out of commodity components is our whole system environment. We very tightly integrate the hardware with our software with our interconnect, and we build that all together from a single system view, not just aggregating a lot of components from a lot of different places and trying to tie it together and integrate them, but building the system from the ground up. I think that’s what has really been, quite honestly, the success of systems like Jaguar, and what I think is going to be the difference for systems like Titan in the market, from people that just go out and get a bunch of fast processors or GPUs and put them together with Infiniband.
TER: What impact do you see Titan having on science and industry and let’s try to put this in a timeframe. Will the system be up and running with users by the time we roll into supercomputing 2012?
UNGARO: Oak Ridge has laid out a timeframe for the system and actually the first phase of the system is upgrading all of the system to the latest AMD Interlagos processors and our brand new Gemini interconnect and that should be available for users in early 2012. And then we’ll start to bring on the GPUs later in 2012 and it should be up for users kind of late 2012 or early 2013 as that gets integrated and through its acceptance process and into the Department of Energy allocation process that they use at Oak Ridge. So that’s the current timeline that we have Mike.
TER: So, we’re probably looking at somewhere in the 2013, 2014, 2015 timeframe to seeing some pretty significant developments based on applications finally figuring their way around the system, and by that time, you guys will probably be ready for another phase – or Titan Plus – won’t you?
UNGARO: LOL. You know one of the interesting things that happened with Jaguar was – within the first couple weeks of Jaguar being up, even before it got through its acceptance process, we already had world record science being done on the machine. So I think that one of the big differences that Oak Ridge has done with their user community is – they have a great team there that works very closely with our team at Cray that help enable their scientists to really take advantage of the machine right when it comes up. So I will be really disappointed if we don’t have some breakthrough science even next year before the machine is fully in production. I know that they are working very hard with their science teams around that and I expect that we’ll have a lot of success there. So hopefully we won’t have to wait until 2013, 2014, or 2015, but we do already have some ideas on what Titan Plus might look like.
TER: So I spoke with Jack Wells at Oak Ridge, and Steve Scott at NVIDIA, and I get the sense that – and I know there are other players involved and I don’t mean to exclude other companies, but from the three of you as the hub of what’s going on here, I sense higher levels of enthusiasm and energy and excitement than I’ve seen with many projects over the past years. It just seems to me, as someone looking at this from the outside, that you guys do have an unusually good synergy established between Oak Ridge, NVIDIA, and Cray. Thoughts?
UNGARO: I would agree and I’d actually include AMD and a couple of other companies into that mix, but I definitely agree with that. We have a long term partnership that we’ve established with Oak Ridge as we went from kind of one machine generation to the next. I think that’s really started to build a cohesiveness, and not just between the vendor community companies like Cray and AMD and NVIDIA, with an organization like Oak Ridge, but also extending out into the user community. We’ve had so many opportunities to spend direct time with the people using these machines and understanding what their needs are so when we bring the next machine out it’s even more ready for what the user community needs. I believe Titan is going to be a big step forward, not just in performance but also in functionality that the user community needs to be very successful within the Office of Science. I’m pretty excited about it and I can’t wait. You know, building machines like this is something that makes me get up in the morning so I’m pretty excited and I know our partners are to.
TER: So Pete, some people in the press are already making a big deal out of Titan being able to challenge Japan for the title of the world’s fastest supercomputer, so I have to ask you, is this really even important?
UNGARO: If you mention the world’s fastest supercomputer as what is number one on the Top 500 list, I actually don’t believe that’s very important. I would tell you clearly that’s not a goal of ours at Cray and I can tell you that’s not the goal of Oak Ridge or the Department of Energy. I think what’s really important is being the system that delivers world class science off the machine. So how can we find alternative energy sources and what simulations can we do and how realistic can we make those simulations to convince us about developing alternative energy sources, or about understanding climate change and the impact of energy policy on climate change. Those are things that are going to really be a difference maker for us. Curing cancer on a supercomputer would be something that I think is way more important than being number one on the Top 500 list. Not to say that that isn’t a great thing too. One of the things that Jaguar has done, is the first system in the world to sustain petaFLOPS performance across a set of applications. There are five applications right now that are running at over a petFLOPS of performance on Jaguar which is pretty amazing. I think that number is going to grow quite immensely when we get to Titan.
TER: Fantastic. Very well said. I’ve been saying for some time that if we really are serious in this country about ending the nation’s dependence on oil, we should be putting more money into exascale research and not into offshore drilling –but I don’t want to make a political statement of course!
But I’m also very excited about the potential. I wish you all the luck in the world and we’ll check back in with you on a regular basis to report on Titan’s progress. So any closing comments for the user community?
UNGARO: Mike I just want to thank you for the interview. Obviously this is a huge milestone for Cray. It has a big impact on our company overall and we couldn’t be happier for what this machine is going to do, and even more so, we’re super excited to be able to deliver it to Oak Ridge and the Office of Science.
This article reprinted with permission of The Exascale Report™.