Over at Infoworld, Joab Jackson writes that system resiliency will be a primary concern for Exascale computing, especially since such a machine will have so many millions of components that something will be failing at any given moment. Now, an SC12 presentation by David Fiala from North Carolina State University may offer a solution.
Fiala presented technology that he and fellow researchers developed that may help improve reliability. The technology addresses the problem of silent data corruption, when systems make undetected errors writing data to disk. Basically, the researchers’ approach consists of running multiple copies, or “clones” of a program, simultaneously and then comparing the answers. The software, called RedMPI, is run in conjunction with the Message Passing Interface (MPI), a library for splitting running applications across multiple servers so the different parts of the program can be executed in parallel.
Read the Full Story.