I have deployed many systems that, in their day, were large scale: several top 30 and a couple top 20 systems. They were deployed in stable programs with established user communities and did not, for the most part, represent a radical departure from what had been done in the past. We worried about clearing floor tiles and getting the power and cooling in the right places. After that, it was all about installing, configuring, and wringing the bugs out of a machine that was built principally from proven parts for people with existing codes that they anticipated would run as expected. Which they usually did.
Other than a new user manual, an updated FAQ, and perhaps a training course the vendor threw in to sweeten the deal, there wasn’t much thought about anything else. But as we cross through the trans-petaFLOPS regime in exascale computing a new pattern is emerging for high-end deployments.
Machines at 10, 20, and 50 petaFLOPS are large enough to break just about everything in the computing ecosystem. Hardware aggregated at this scale starts to show increased failure rates that we have only worried about theoretically in the past. Interconnects and system software have never been tested at this scale, and both hard and soft errors take flight from previously uninteresting corners of the system. And users begin to dig deeply into both their applications and the science supporting those applications as they attempt to break down the next important barriers in science. All of these stressors become more pronounced the larger the machine, and most people anticipate that the transition into exascale will be disruptive to both users and to the centers fielding them.