Video: Effect and Propagation of Silent Data Corruption in HPC Applications

“Modern HPC systems are complex due to the sheer number of components that comprise them. With this complexity comes the reality of failures. One particular damaging and little understood type of failure is silent data corruption (SDC). SDC occurs when program state changes without intervention of the application or the system. An understanding of how applications handle state perturbations and how these corrupted values propagate through HPC applications is key to mitigating its effects. In this talk, we present our results from fault injection experiments on an Algebraic Multigrid linear solver.”