Dan Reed reposted an essay on his blog that recently appeared at the CACM blog in which he talks about the shortcuts (my word, not his) we took to get to petascale, and his hope that we take a longer view on the way to exascale
He writes (referring to some of the original petascale activities in the early 1990s)
At the time, most of us were convinced that achieving petascale performance within a decade would require some new architectural approaches and custom designs, along with radically new system software and programming tools. We were wrong, or at least so it superficially seems. We broke the petascale barrier in 2008 using commodity x86 microprocessors and GPUs, Infiniband interconnects, minimally modified Linux and the same message-based programming model we have been using for the past twenty years.
However, as peak system performance has risen, the number of users has declined. Programming massively parallel systems is not easy, and even terascale computing is not routine. Horst Simon explained this with an interesting analogy, which I have taken the liberty of elaborating slightly. The ascent of Mt. Everest by Edmund Hillary and Tenzing Norgay in 1953 was heroic. Today, amateurs still die each year attempting to replicate the feat. We may have scaled Mt. Petascale, but we are far from making it pleasant or even routine weekend hike.
This raises the real question, were we wrong in believing different hardware and software approaches were needed to make petascale computing a reality? I think we were absolutely right that new approaches were needed. However, our recommendations for a new research and development agenda were not realized. At least in part, I believe this is because we have been loathe to mount the integrated research and development needed to change our current hardware/software ecosystem and procurement models.
Reed’s suggested solution?
I believe it is time for us to move from our deus ex machina model of explicitly managed resources to a fully distributed, asynchronous model that embraces component failure as a standard occurrence. To draw a biological analogy, we must reason about systemic, organism health and behavior rather than cellular signaling and death, and not allow cell death (component failure) to trigger organism death (system failure). Such a shift in world view has profound implications for how we structure the future of international high-performance computing research, academic-government-industrial collaborations and system procurements.
I agree with this point of view, and it has echoes of some of the comments Thomas Sterling made at the HPCC conference a couple weeks ago in Newport as well, in the sense that both advocate an revolutionary, rather than a evolutionary, approach to exascale. My own reason for agreeing with this point of view is that while, yes, we can build petacale machines, we are getting between 1% and 5% of peak on general applications. This is what an evolutionary model gets you. We are well past the point when a flop is worth more than an hour of application developer’s time. We need to encourage the development of integrated hardware/software systems that help programmers write correct, large scale applications that get 15, 20, or even 30% of peak performance. To mangle Hamming, the purpose of supercomputing is discovery, not FLOPS.
Not that I think it will happen. The government has been stubbornly unwilling to coordinate its high end computing activities around any of the several research agendas that it has funded the creation of, but not the implementation (you could pick an arbitrary starting point with PITAC reports, or move either way in time to find sad examples of neglect). My own observations from inside part of this system is that the government has largely begun to think of HPC as “plumbing” that should “just work” in support of R&D, not as an object of R&D itself. There are a few exceptions (mostly in parts of DOE), but without leadership that starts in the President’s office (probably with the science advisor pushing an effort to get POTUS to make his deparment secretaries fall in line) this is not likely to change on its own.
Our curse is that we have something that kind of works. One of my grad school professors used to say that the most dangerous computational answers are those that “look about right.” If we had a model that was totally broken, we’d be forced to invest in new models of computation and because of the scale of that investment we’d be encouraged to make a coordinated effort of it. But our model isn’t totally broken, and as long as it kind of works, I don’t see anyone willing to dump out the existing rice bowls and start over.