At a workshop in March, U.S. experts met to discuss issues relating to the fault-tolerance of today’s and tomorrow’s petascale and exascale computing systems. The group explored past practices and common pitfalls, and discussed strategies to ensure that these systems and the applications they run can tolerate the inevitable faults.
As part of his presentation on experiences from FLASH (a physics code), Lamb suggested a solution
…called Fault Tolerance Backplane, that could keep the application informed about the state of the machine and use this knowledge to write a checkpoint before an imminent failure, thereby avoiding the expensive recovery scenario.
Presenter John Daly advocated a shift from fault-tolerance in systems to resilience in applications
Resilience, on the other hand, an application-centric paradigm, aims to protect applications from data corruption and Byzantine faults, Daly said. It aims to do so in a timely and efficient manner (considering tradeoffs in power, productivity and performance) and in the presence of hardware or software degradations and failures.
More in the article, which is a recommended read for a significant issue in our community.