Sign up for our newsletter and get the latest HPC news and analysis.

Christmas, failure, and the exascale

Dan Reed was up late last night penning a new blog post that starts, appropriately for these post-Halloween weeks, with that most horrible chore of the year

Christmas LightsI speak, without hesitation or ambiguity, regarding the curse (sometimes literal) of serially wired holiday lights. My father and mother kept a healthy supply of spare bulbs for laborious and sequential replacement and testing whenever a single bulb failed, and a strand of thirty bulbs went dark. I still remember my excitement and delight when we first purchased strands with parallel circuit wiring.

…If there is an Aesop-like moral in this tale from my childhood, it relates to systemic design for resilience rather than component resilience alone. Parallel circuit resilience trumps serial circuit resilience, and the extra cost is repaid in greater systemic reliability. Alas, I fear we have not learned this lesson in parallel computer system design and parallel programming models and applications.

Dr. Reed goes on to draw an analog between our highly failure-intolerant programming schemes, and those serial Christmas lights. With enough lights (processes), you are guaranteed to spend a lot of your holiday season with dark spots on the tree. We need to inject failure tolerance into our application programs and system stacks, and Dan suggests a place to start looking for prior art: large scale cloud computing

To understand this potential shift in perspective, I heartily recommend Werner Vogels’ analysis of the power of eventual consistency for large-scale web services at Amazon. Eric Brewer’s thoughts on the CAP theorem, drawn from his Inktomi experiences, have also shaped theoretical and empirical assessments of large-scale system reliability. For those not familiar with the CAP theorem, it postulates that one can choose any two of Consistency, Availability or Partition tolerance. More generally, it offers a framework to reason about conflicting objectives.

He avoids claiming cloud as panacea, but he is right that there is a lot of IP already developed that our community should be carefully studying.


  1. “the CAP theorem…postulates that one can choose any two of Consistency, Availability or Partition tolerance.” More reasons to tailor design to purposes and objectives.

Resource Links: