In this video from the NCSA Blue Waters Symposium, Jon Calhoun from the University of Illinois at Urbana-Champaign presents: Effect and Propagation of Silent Data Corruption in HPC Applications.
Modern HPC systems are complex due to the sheer number of components that comprise them. With this complexity comes the reality of failures. One particular damaging and little understood type of failure is silent data corruption (SDC). SDC occurs when program state changes without intervention of the application or the system. An understanding of how applications handle state perturbations and how these corrupted values propagate through HPC applications is key to mitigating its effects. In this talk, we present our results from fault injection experiments on an Algebraic Multigrid linear solver. We explore the sparse matrix vector multiply, a fundamental component to AMG and other HPC applications. In addition, we explore the effects of SDC on other applications and HPC computation kernels. Finally, we discuss algorithm level fault tolerance for SDC detection.