There is a right way, and a wrong way, to exascale

Dan Reed reposted an essay on his blog that recently appeared at the CACM blog in which he talks about the shortcuts (my word, not his) we took to get to petascale, and his hope that we take a longer view on the way to exascale

He writes (referring to some of the original petascale activities in the early 1990s)

At the time, most of us were convinced that achieving petascale performance within a decade would require some new architectural approaches and custom designs, along with radically new system software and programming tools. We were wrong, or at least so it superficially seems. We broke the petascale barrier in 2008 using commodity x86 microprocessors and GPUs, Infiniband interconnects, minimally modified Linux and the same message-based programming model we have been using for the past twenty years.

However, as peak system performance has risen, the number of users has declined. Programming massively parallel systems is not easy, and even terascale computing is not routine. Horst Simon explained this with an interesting analogy, which I have taken the liberty of elaborating slightly. The ascent of Mt. Everest by Edmund Hillary and Tenzing Norgay in 1953 was heroic. Today, amateurs still die each year attempting to replicate the feat. We may have scaled Mt. Petascale, but we are far from making it pleasant or even routine weekend hike.

This raises the real question, were we wrong in believing different hardware and software approaches were needed to make petascale computing a reality? I think we were absolutely right that new approaches were needed. However, our recommendations for a new research and development agenda were not realized. At least in part, I believe this is because we have been loathe to mount the integrated research and development needed to change our current hardware/software ecosystem and procurement models.

Reed’s suggested solution?

I believe it is time for us to move from our deus ex machina model of explicitly managed resources to a fully distributed, asynchronous model that embraces component failure as a standard occurrence. To draw a biological analogy, we must reason about systemic, organism health and behavior rather than cellular signaling and death, and not allow cell death (component failure) to trigger organism death (system failure). Such a shift in world view has profound implications for how we structure the future of international high-performance computing research, academic-government-industrial collaborations and system procurements.

I agree with this point of view, and it has echoes of some of the comments Thomas Sterling made at the HPCC conference a couple weeks ago in Newport as well, in the sense that both advocate an revolutionary, rather than a evolutionary, approach to exascale. My own reason for agreeing with this point of view is that while, yes, we can build petacale machines, we are getting between 1% and 5% of peak on general applications. This is what an evolutionary model gets you. We are well past the point when a flop is worth more than an hour of application developer’s time. We need to encourage the development of integrated hardware/software systems that help programmers write correct, large scale applications that get 15, 20, or even 30% of peak performance. To mangle Hamming, the purpose of supercomputing is discovery, not FLOPS.

Not that I think it will happen. The government has been stubbornly unwilling to coordinate its high end computing activities around any of the several research agendas that it has funded the creation of, but not the implementation (you could pick an arbitrary starting point with PITAC reports, or move either way in time to find sad examples of neglect). My own observations from inside part of this system is that the government has largely begun to think of HPC as “plumbing” that should “just work” in support of R&D, not as an object of R&D itself. There are a few exceptions (mostly in parts of DOE), but without leadership that starts in the President’s office (probably with the science advisor pushing an effort to get POTUS to make his deparment secretaries fall in line) this is not likely to change on its own.

Our curse is that we have something that kind of works. One of my grad school professors used to say that the most dangerous computational answers are those that “look about right.” If we had a model that was totally broken, we’d be forced to invest in new models of computation and because of the scale of that investment we’d be encouraged to make a coordinated effort of it. But our model isn’t totally broken, and as long as it kind of works, I don’t see anyone willing to dump out the existing rice bowls and start over.

Comments

  1. Jeff Layton says:

    I think this topic of new concepts to reach even greater levels of performance is very interesting. To be honest, I don’t really have a strong feeling one way or the other but being interested in technology I am curious what new technologies could do for performance.

    One thing that is interesting to me is John’s and Dan’s comments about how HPC technology has fallen on hard times because of the lack of leadership. One thing I find interesting in these comments is that if we take a capitalist approach, then economic demand would drive development of ever faster machines. But since we’re not seeing new technology for faster performance, except perhaps the development of accelerators, one could make the observation that the economy isn’t asking for new technology. While I don’t consider myself a right-wing capitalist, I have seen competition and the market drive some very interesting technologies and we haven’t seen “it” driving any new high-end technologies. While I’m not prepared to say that this means the world is happy with what we have, I do think it is an interesting data point that the economy and world market are seemingly not asking for exascale systems (or even petascale).

    Alternatively, we do have certain government agencies asking for more and more computational ability and rightly so. And as John and Dan both point out, the government has not taken a lead on developing on developing hardware to get to the level of needed performance. But let me play devil’s advocate for a moment. Let’s suppose the government did take a commanding lead that is better funded and better coordinated, ultimately funding companies to do research and development into new technologies for faster performance. One of the possible problems I see with this approach is that only a very few companies would be able to do this R&D and arguably we would be down to maybe 2 companies in this field. The systems would be, presumably, very expensive, and the companies would have to charge government customers a great deal of $$ for the systems. Would the government be willing to invest R&D dollars into new systems and then turn around and pay lots of money for them? I think that government customers (and funding agencies) have gotten used to the low price/performance ratios that clusters and open-source have brought to the table and would not be willing to pay the high prices and would buy fewer systems, driving up the price. All of a sudden, bingo, we are in the viscous death spiral of expensive government funded programs. The only way out of this is to either produce a true revolution in price/performance with new technology or have the government customers commit to buying a number of large of these systems.

    Before anyone jumps down my throat and says that I don’t understand the market or the need for exascale ssytems, etc., let me just say that I’m not advocating government funding and I’m not advocating letting the market drive development. I’m just observing what is happening and trying to make sense of things. While I think some government funded development is fine at any point in time (and I worked in the aerospace industry which lives on government funding), I also think that it is a good idea to see what the market is “telling” you in general. I’ve seen the argument a million times, that government investment into very cutting-edge risky technology will works its way into mainstream consumer products, personally, I don’t see too much of this lately, particularly in HPC at this time.

    I don’t know about government investing into cutting edge HPC technologies. Part of me says, yes, it’s a great thing and should be pursued with vigor and part of me thinks that the market does a pretty good job of pushing technology in the direction that people need. I guess I’m saying that a combination is perhaps the best solution, but I don’t know what the ratio should be.

    Anyway, some early morning thoughts. :)

    Jeff

Resource Links: