The Path to Exascale: Learning From The Past; Planning For The Future

Are we moving in the right direction to achieve necessary levels of global collaboration on exascale development? Is 2018 an achievable goal? Are China, Europe and the rest of the world outpacing the U.S. with technology research? Are we doing enough? What lessons are we learning and, more importantly, are we using what we’ve learned to make better decisions?

In this feature interview, Dave Turek, Vice President of Deep Computing at IBM, dives into some of today’s most pressing questions to give us a thought provoking, balanced perspective. Dave explores topics from ‘global collaboration’, thoughts on ‘heterogeneous or homogeneous systems’ and his own personal observations on what the scientific computing can do today – and over the next seven years – to prepare for the exascale systems of tomorrow.


Note: This interview was transcribed by the staff of The Exascale Report. Any typos or grammar problems are the result of that transcription and should not be attributed to IBM.

The Exascale Report: Good Morning Dave. Thank you very much for joining us for this interview with the Exascale Report. I’d like to start off with something very simple, a very fundamental question. We’ve written quite a bit about the fact that the industry pundits are saying we will require an unprecedented level of global cooperation in order for us to achieve exascale systems. Do you agree with this — that an unprecedented level of global cooperation/collaboration is necessary? Are you seeing an adequate level of this collaboration now?

Dave Turek: Well I think there is going to be a tremendous amount of collaboration — whether it’s unprecedented or not is for others to decide. The issues are driven by the magnitude of innovations and inventions that need to take place to reach exascale. I think as you and your listeners and readers know, we’ve really reached kind of a barrier with respect to the evolution of a lot of the technologies and to get over those barriers will require tremendous amount of innovation. Not within the purview of any particular company to execute – there is actually an eco system of hardware and software players that need to get engaged and make this happen. So to that extent I think it is unprecedented in the HPC space to observe the magnitude of innovation that will be required to get to this next level of computational capability. As a consequence it will drive a tremendous amount of international collaboration. A lot of which we are seeing today by the way.

TER: We also talk about Europe a lot. Europe seems to have taken quite an aggressive stance toward exascale development particularly with the EESI and is in fact home to some significant exascale research activities. Why Europe and what is your prospective on this?

Turek: I think Europe sees this the same way as America sees it, the same way as China sees it, and so on which is computational capability is a necessary ingredient to either stimulate or perpetuate the economic competitiveness. So the application of computational science, computational engineering, and so on are avenues to both bring products to market quicker, to bring better products to market and to bring products that are better designed, than one can do through trial and error even more subordinate kinds of technological applications. So I think the universal theme that you’re seeing is acceptance by governments around the world that there is an equivalency between the notion of the embrace of computational science at the extreme and innovations that will percolate from and through their economies. Europe is no different than what we are seeing elsewhere in the world. They’ve got programs in place to try to effectuate this and as a consequence you’re seeing a fair amount of activity.

TER: Yes, we have seen that and you’re right. We’ve touched on a little bit here – the international scenes of China and Japan and Russia — how do you see the activity in these different segments affecting the global HPC community in the short term?

Turek: I think they all have similar aspirations. I think they have different pathways they are exploring for how to get there. They have different views of how technology companies will participate.
But I think perhaps the most important element that comes out of this is that this will act as a catalyst for innovation – number one. Because I think innovative companies will see that there’s not simply demand emanating from a parochial government agency or government somewhere in the world even if it is fairly substantial. They’ll see there’s really a broad based global opportunity. And so, that promise of opportunity, I think will accelerate the whole rate of progress in innovation to support exascale, and as a consequence you’ll see benefits accrue more quickly. I think the other thing you’ll see is — that as this becomes a strategic pillar in the notion of how one achieves economic competitiveness, it should give rise to stimulation of greater and greater amounts of skill and especially a lot of innovation on the software side which has been certainly slower in the marketplace over the last couple of decades than what we’ve seen in terms of innovation on the hardware side. So there’ll be a transformative effect in terms of the magnitude and types of skills that get created as well as the kind of technologies that get produced as a result of this.

TER: Do you think today the US Government agencies are doing enough to drive a coordinated effort toward exascale research – to the level of what we’re seeing particularly like in Europe?

Turek: Enough is always in the eye of the beholder or to the eyes of the futurist who looks back with 20/20 hindsight and makes those declarations. I think there are a number of government agencies in the US that have materially engaged — and have been for quite some time. The notion of exascale is not a recent phenomenon. In the case of IBM in particular we’ve been working on this for about 3 years already and that has given us a lot of insight in terms of what some of these technological transformations need to be. Beyond that though, I think what the US government has done terrifically has been identification of problem domains that will be benefited by exascale kinds of capability. A lot of this frankly, because it emanates through the
Department of Energy, swirls around things like smart grid, carbon sequestration, climate models, etc., etc., but even within the DOE with their focus through the national science laboratories that goes beyond just energy related issues and rolls into material science and biological sciences etc., you are beginning to see the emergence of a very, very broad portfolio of application domains that will make use of this technology. That becomes a proxy to the marketplace of demand possibilities as well, so as entrepreneurial players as innovative companies begin to look at these domains of opportunities emerge, they begin to cast that in economic terms motivating investments as well.

So, the activity of the US Government agencies, and it’s not the DOE alone by a long shot, but those kinds of activities relative to the way the US economy works and so on, I think are quiet appropriate.

The economic models in Europe, China and Russia, I think are different. As a consequence you see different kinds of behaviors – in some cases it comes about because of the need to establish broad-based international consensus, as a consequence, for example, as a construct to the EU government as well as a more industrial policy related set of activities that you might see coming out of China. So, they all differ but they all have similar objectives and they will all make investments and stimulate activity to try to motivate achievement of the goal.

TER: Another point that kind of splits a lot of folks in the community is the realization of real production class exascale systems. Pretty much everyone says we can hit an exascale benchmark — whatever we decide that benchmark should be. However, getting real functional systems out there — are you on the side that believes the realization of production exascale class systems is inevitable?

Turek: Well nothing is inevitable unless you work to make it your goal and take the actions to make it happen. I think our industry has far too long been caught up in the notion of achieving benchmark results without assuming the responsibility to actually get productive work use. I think there is a landscape littered with systems that have been deployed for perhaps less than strategic benefit and rationale.

From our design perspective, from the very beginning we made just a couple very simple statements. One — the creation of systems to pursue benchmarks is nothing more than a science fair experiment, which we have no interest in and we’ve had that as our philosophy for quite some time, even leading up to Petascale.
You’ll recall we were the first company to produce sustainable PetaFLOPS applications in an operational context, and it was a consequence of being committed to making sure that there is an alignment with what the clients want to achieve with respect to the kind of technology that needs to be deployed to achieve that ambition.

So I think it would be detrimental to the industry broadly if there was simply a race to get to exascale without regard to the principals of trying to achieve real production quality kinds of systems.

A concrete example has to do with this whole notion of reliability. We know, and I think finally everybody in the industry knows, that to reach exascale class computers we need to deploy millions of parts in a computer. Millions, potentially, tens of millions of micro processors associated with other chips, networking capability and so on. And while we were all brought up from our infancy to understand the basic proposition that integrated circuits are highly reliable, which we sort of cavalierly equated to well they never fail, the fact of the matter is that when you start producing millions of things, the statistics will demonstrate that yes, you will observe failures and you’ll observe them at what appears to be a pretty regular basis. You can’t turn your back on that issue. You can’t just say well that’s the way it is — everyone has got to learn to live with it. We think that there are new paradigms that have to be developed — there are new models and new technologies that have to be developed — and that issue has to be resolved at a fundamental level to achieve this notion of production-quality exascale systems. The abandonment of reliability and availability as principle kinds of design constructs is an abandonment of seriousness with respect to which a company would approach this problem. So, for us, this is first order of business. Yes, get to exascale, but by golly get it there in a way that works and is economical, and is going to allow people to work in a way they are comfortable with working.

TER: Let’s talk about some of the lessons that we’ll learn in the short term here particularly with Blue Waters. Our perspective is, when it hits the top 500, you’ll have something that is 4-5 times faster than anything else out there. What I’d like to get from you is some insight — what do you think IBM will gain from deploying Blue Waters that will help us on this path to exascale?

Turek: I think there are a number of lessons to be gained from Blue Waters, just as there were a number of lessons we garnered from Roadrunner at Los Alamos, and there are a number of lessons which we are learning inside IBM today with respect to the Sequoia system targeted for a year after Blue Waters, based on Blue Gene technology at Lawrence Livermore National Laboratory. I think the lessons cascade across a whole array of software related issues first of all, that deal with programmability, as well as simplistic kinds of things —at least intellectually —like manageability and the ability to observe a system and how to intervene, etc. And again, because you are getting to the scale of size in terms of parts and so on that preclude you from applying very casual kinds of approaches to these kinds of issues. You really have to dig in and develop these systems from the ground up, if you will, to make sure that as you get to an operational stage, the people who are working these systems can actually monitor and manage them, program them, and repair them, and do all those things in a reasonable kind of fashion.

So I think the software threads are important – first order. I think there is a tremendous amount that we learned in the past and that we continue to learn in terms of systems design for energy efficiency. We’ve set for ourselves, for example, a target to produce an exascale system at 20 megawatts. Now that’s still a pretty hefty amount of energy use, but on the other hand, you see systems in the petascale range today that are operating at roughly 10 megawatts. And so you know those models can’t possibly scale in to the future. There is a need for radical redesign.

So beginning in 1999, with the advent of our commitment to Blue Gene, we’ve embraced this notion of design for energy efficiency, and we’ve observed a cascade through all the subsystems and technology in the system. At exascale for example, memory subsystems will drive megawatts of power. Networks, for communication and so on, will drive megawatts of power. It’s not just the microprocessors that one has to worry about. It’s the total system design, total system architecture for energy efficiency which is quite critical.

The third thing is space. It’s imperative that design for these systems be capable of being housed in reasonable kinds of data centers or other kinds of facilities. One has to really focus on this because it is untenable from a business perspective to engage in a model of exascale design which is so sloppy in its footprint considerations that, where as all the other design attributes may be on the mark, you go to a client and say, “By the way, to house this you need to build a new building.” New buildings of this magnitude can easily get someone into the 100 million dollar range, so, it’s imperative that we look at the totality of these issues. Design for energy efficiency, design for space efficiency and consideration of software models, all the way from programmability to manageability that are compatible with conventional ways in which people want to operate these kinds of systems. All those lessons began with our efforts on Road Runner, they carry through in our efforts on Blue Waters, they will develop further with Sequoia, and they’ll be leveraged quite dramatically as we get into the exascale realm.

TER: Thank you so much for that detailed answer Dave. Let me ask you, do you consider Blue Waters to be a one off system?

Turek: No. We consider Blue Waters to be a very forward thinking kind of design especially with respect to some of the software and I/O issues that are imperative to how you go forward. Remember Blue Waters came about through the HPCS program at DARPA, and our response to that program was not about build a ‘go fast’ computer, but was to build a very, very fast computer that was focused on ease of use and programmability. The philosophical thread that Blue Waters represents is not to stretch the limits of how fast you can build a computer but how to build really, really fast computers that achieve the dual goals of ease of use and programmability. So, it’s principally that thread that we have leveraged from Blue Waters that’ll bring progressive degrees of insight in terms of our designs going forward.

TER: If someone came along, some agency at some point and asked you to develop and build just one exascale system, would you do it?

Turek: No. One of our philosophical tenets is that – that would be the rough equivalent of a science fair experiment, and what our interest is here, is to move the industry at large forward. Because you see, whatever ambition IBM has with respect to exascale, it can’t be executed unilaterally. We need help from the networking community. We need work from the semiconductor industry. We need help from the memory industry. None of these are necessarily core IBM businesses. We may do elements of working these sectors but they are not our core business. So we actually need to bring the industry forward in lock step, to innovate across the board to help realize this ambition. Now along the way, if there are very special projects that come up — that we think will provide insight or will help stimulate the development of key technologies — under that kind of proviso we would absolutely consider something like that. But to build something peculiar that’s one off — that you do one of —is highly problematic. If for no other reason than the opportunity cost associated with taking some of the best and brightest, and allocating them to what’s likely going to be a multi-year effort and a single instance, when those people could probably be better deployed against the problem of trying to bring the industry forward in lock step to reach a more broadly acceptable model of exascale computing.

TER: So, do you see the next architecture for exascale being heterogeneous? Could you talk about the role for GPU’s or ARM processors in the development of Exascale?

Turek: So, I think that first of all, nobody has a perfect crystal ball on this front in terms of GPUs, or ARM processors, or FPGAs, DSPs or whatever the case may be. I know there is a current enthusiasm in the marketplace for GPU’s and there appears to be a growing enthusiasm for ARM processors. There’s a rationale to explain all these levels of enthusiasm in the market segments which are approaching them aggressively versus market segments that are maybe taking a more of a ‘wait and see’ attitude. From our prospective, it is always wiser to read signals from the marketplace than to more arrogantly, think that we have particular insight into the motivational behaviors of clients far and large. Remember, a lot of things get pressed because of the nature of the individual client which might be pursuing it —a government agency, a national lab, a major university, or what have you. But you always have to keep your eye on the marketplace in the broadest possible instantiation to get a real true picture of what the trends and directions are. So, in answer to your question, I think that future architectures, if they are respectful of design for energy efficiency and space efficiency, will look very carefully at the partitioning of execution responsibility — among potentially heterogeneous kinds of processors.

So we know first of all, everything is multi-core. That is not debatable. Whether those cores are going to be homogeneous or heterogeneous — the marketplace will help sort that out over the next year or two.

I think that we’ll see a lot of evolutionary thought in terms of whether or not specialized processors need to come in to play here. I think all those sorts of individual threads on technologies the people are somewhat comfortable discussing, always have to be balanced against the other side of the equation which is software. One can postulate a role for FPGAs, GPUs, or DSPs —or you come up with the technology, and there’s been a lot of evolution in terms of hardware advances, etc., but it always comes down to what about the software? Are there programming models that support it? Is there a way to migrate software forward that exists today? Today’s software install base is measured in trillions of dollars. Is there an imperative that will drive people to migrate these forward? Is there a way to migrate these forward, or are we at a point in time where it’s going to be throwaway and do everything from scratch again. I think those issues are the key issues that will get sorted out over the next two to three or four years, and will help provide the kind of insight to drive things forward. So, there are no preordained winners or losers here. There is simply a process that has to be undergone. There has to be the examination of the results that everybody has to do judiciously.

There is innovation that is going to come along in the next two or three years that will be novel relative to what everybody knows about today, and the marketplace will help sort this out. Now we’ll make some big bets in IBM’s case in particular — the magnitude of out bets will be quite large. Ten figure kinds of bets. And so we pay very close attention to all technology trends, but we pay equally close attention to the business side of what’s going on — behavioral issues and other kinds of issues as well. So it’s unsettled for now. There will be an opportunity for the marketplace to make declarations on all these things in the next two or three years.

TER: What about the POWER Architecture? Where does it go from here?

Turek: POWER architecture continues to evolve. We have one tremendous advantage as we pursue exascale over everyone else in the industry, which is, we actually have the wherewithal and the ability to sort of pull levers on all the critical elements of technology to achieve exascale. So, it’s manifested itself in terms of the radical degree of system integration we did with the emergence of the Blue Gene program in the early part of the last decade and it carries forward into the future. We don’t have to rely on the generosity of partners, shall we say, to do the kinds of innovation on some of the key components that we think are fundamental and necessary to really drive some of the design issues that I mentioned earlier to achieve the goals of what we are trying to get to.

It doesn’t mean we have control over everything — I talked about that before as well. But we’ve already engaged with other companies in the industry on a collaborative and partnership basis to ensure that, as we evolve POWER and as we evolve our thoughts on system design and architecture, other critical technologies that are going to be germane to this discussion will be there for us to use.

And in some cases, by the way, that means really leveraging IP that IBM has produced in a wide array of areas. You’ve seen the statistics as current as this week in terms of the number of patients granted to IBM. That active patient portfolio is a critical element to help facilitate some of the collaborations and partnerships in new technologies that we think will be important to the achievement of Exascale.

TER: Great! I’d like to chat a little bit about power and liquid cooling technologies and what you see IBM doing in this area, but we may be running out of time so let me end on a couple things then maybe we can get back to you for the next issue — and talk specifically about those two technologies.

Turek: That’s fine

TER: OK – we’re facing a lot of political challenges, funding — the technology challenges that are out there — what about the science? What can the scientist do, if anything, to start preparing for 1000X performance improvements in computation? For all the new computational science folks that are coming through the university system who in 7-8 years will be using exascale systems — what can we offer to them to get ready for this new world?

Turek: Well, I think — and this is going to sound, perhaps at one level a bit silly, but I think that what scientists can do is start stretching their imaginations. And what I mean by that is for the last several years, I’ve actually gone to universities and other institutions, and I posed a question to a lot of the researchers — “What would you do if you had an infinite amount of computing?” Now, an infinite amount of computing is of course untenable, but the nature of the way the question was posed was to shock people in to thinking outside of the box that they put themselves in. Well, why are people in a box? If you go to the universities and you look at some of the junior facility, you find out they are held hostage by budgetary constraints, or by their ability to garner grants from the NSF or NIH or somebody like that, and as a consequence you find very brilliant people doing terrific science, but very, very small amounts computing simply because the way grant structures are structured and so on. There hasn’t been the ability or the money frankly to fund everybody to the degree they might have wanted to be funded. Well the point is that when you get accustomed to using a particular kind of tool, you think that that’s the kind of tool you have to use for everything. And your imagination sort of wilts a little bit in terms of exploration and what the possibilities might be. The trivial example is — when you give a little kid a hammer, suddenly everything looks like a nail. Well, if you give someone a workstation, suddenly everything they try to do scientifically they try to do on a workstation. They have no imagination for what would happen if they suddenly had a million workstations. So, there is tremendous power associated with simply having people ask themselves the question — what would they do if they had a million times more compute power than they had today? A million more processors. A million more workstations. Whatever the metric is that would resonate with the person one was talking to. That’s a critical thing.

I think the second thing is — I think people need to, in the scientific community, start to do a retrospective examination of what the origins are — or had been to some of key technologies that they have a dependency on, and begin to explore quite critically, whether or not they are being held hostage by those dependencies. What I mean by that is —if you look at the evolution of the HPC Community over the last 40 years or so. You observe that a lot of critical software came out of NSF grants, and other kinds of grants, at universities in the 1960s and early 1970s, that gave rise to development of algorithms, numerical methods and the implementation of software approaches that frankly have remained persistent to this day. And if you look back to the state of computing in the 60s and 70s, and you try to compare it to what is available today, there is no comparison. This gives rise to the fallacious kind of conclusion that many people draw.

They look at the HPC market and they say why would you build a system of this scale and this size because our software doesn’t scale to anything more than 64 nodes or 128 nodes? The fallacy of the argument is — that conclusion is based on approaches to computational science that occurred in an era when no one could envision computers with a million processors. We’ve seen work at places like Argonne National Lab, Lawrence Livermore, and certainly other places around the world that have taken tried and true algorithms and with access to modern computing platforms, recast the algorithms to really map on to these new architectures and suddenly applications that could only scale to 128 nodes are now at 1,000 nodes or 200,000 nodes. I think that many people who work with numerical methods — who work with simulation — who have a dependency on commercial codes and so on, need to start looking very carefully at the extent to which their science is being held hostage, by having a dependency on approaches that are archaic relative to, not only where computing architectures will be five or six years from now, but even relative to where they are today. That is a huge, huge issue that I think people need to step up to.

TER: Ok so let’s wrap this up with you personally Dave. What’s your vision? What excites you the most that you think about where we might be at the end of this decade?

Turek: I think from my perspective, the excitement really comes from the application domains that one is able to envision being explored in ways people haven’t thought of. People talk about multi-physics or multi-scale problems, this whole notion of being able to look at problems from the atomic level all the way up to the super macro level, and begin to look at the applications of that, for example, in human medicine, where it’s not simply a matter of modeling the cell, but being able to model the nature of what’s going on at the atomic level within the cell, and then aggregating this in the context of what it means as those cells conspire to act as an organ, and then, those organs conspire with others to form a human, and then exposed to a disease vector – what does it all mean from the atomic level all the way up through all the other subsystems that constitute a human being?

And you see this whole domain of multi-scale problems in everything we do. So, if you think about redesigning new airplanes or new automobiles — being able to look at things from the atomic level up to the macro level becomes equally important providing insight into new materials and giving rise to the creation of more efficient automobiles, airplanes and so on. And so I think it’s this notion of scale at the computational level that lets us really approach problems that we see everywhere from multi-scale perspective, that I think will drive tremendous innovation, tremendous insight into new products, new approaches to healthcare, new approaches to industrial design, new insights into behavior, climate, etc. etc., that I think will be transformational to the economies that embrace these approaches. And so it’s the creation of the technology, the applications of problems, coupled with the industrial or political will of particular economies to actually deploy this capability in a way to achieve a more enlightened state in terms of how we operate economies around the world. That what excites me!

TER: Fantastic. Dave, thank you so much for taking the time for this interview. We really appreciate it and we look forward to talking with you again in the near future. Good luck with everything that takes you and IBM down that path.

Turek: Thanks very much. I look forward to continuing the discussion.

For related stories, visit The Exascale Report Archives.