Sign up for our newsletter and get the latest HPC news and analysis.

Fault tolerance at scale: embrace your inner slacker

iSGTW is running a feature article this week with an overview of the issues discussed at the Fault Tolerance for Extreme Scalability Workshop sponsored by the NSF in March.

At a workshop in March, U.S. experts met to discuss issues relating to the fault-tolerance of today’s and tomorrow’s petascale and exascale computing systems. The group explored past practices and common pitfalls, and discussed strategies to ensure that these systems and the applications they run can tolerate the inevitable faults.

As part of his presentation on experiences from FLASH (a physics code), Lamb suggested a solution

…called Fault Tolerance Backplane, that could keep the application informed about the state of the machine and use this knowledge to write a checkpoint before an imminent failure, thereby avoiding the expensive recovery scenario.

Presenter John Daly advocated a shift from fault-tolerance in systems to resilience in applications

Resilience, on the other hand, an application-centric paradigm, aims to protect applications from data corruption and Byzantine faults, Daly said. It aims to do so in a timely and efficient manner (considering tradeoffs in power, productivity and performance) and in the presence of hardware or software degradations and failures.

More in the article, which is a recommended read for a significant issue in our community.

Trackbacks

  1. [...] Fault tolerance at scale: embrace your inner slacker (insidehpc.com) [...]

Resource Links: