Dan Katz at Louisiana State University sent me a note this morning letting us know that the TeraGrid’s 2009 Fault Tolerance Workshop, “Fault Tolerance for Extreme-Scale Computing” started this morning in Albuquerque. From the abstract
The purpose of this workshop is to discuss fault-tolerance on large systems for running large, possibly long-running applications. The main point of the workshop will be to have systems people, middleware people (include FT experts), and apps people talk about the issues and figure out what needs to be done, mostly at the middleware and app levels, to run such apps on the coming petascale systems, without having faults cause large numbers of application failures.
A detailed program at the link above. This is a huge issue for our community going forward, and I’m looking forward to reading the product of the workshop.