Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems

In this video from PASC18, Christian Engelmann from Oak Ridge National Laboratory presents: Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems.

“Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. The Catalog project develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current supercomputers and extrapolates this knowledge to future-generation systems. To date, the Catalog project has analyzed billions of node hours of system logs from supercomputers at Oak Ridge National Laboratory and Argonne National Laboratory. This talk provides an overview of our findings and lessons learned.”

Christian Engelmann is an Senior R&D Staff Scientist in the Computer Science Research Group at Oak Ridge National Laboratory, which is the US Department of Energy’s (DOE) largest multiprogram science and technology laboratory with an annual budget of $1.4 billion. He has 17 years experience in software research and development for extreme-scale high-performance computing (HPC) systems with a strong funding and publication record. In collaboration with other laboratories and universities, Dr. Engelmann’s research solves computer science challenges in HPC software, such as scalability, dependability, energy efficiency, and portability.

His primary expertise is in HPC resilience, i.e., providing efficiency and correctness in the presence of faults, errors, and failures through avoidance, masking, and recovery. Dr. Engelmann is a leading expert in HPC resilience and was a member of the DOE Technical Council on HPC Resilience. He received the 2015 DOE Early Career Award for research in resilience design patterns for extreme scale HPC. His secondary expertise is in lightweight simulation of future-generation extreme-scale supercomputers with millions of processors, studying the impact of hardware and software properties on the key HPC system design factors: performance, resilience, and power consumption.

Dr. Engelmann earned a M.Sc. in Computer Systems Engineering from the University of Applied Sciences Berlin, Germany, in 2001, a M.Sc. in Computer Science from the University of Reading, UK, also in 2001 as part of a double diploma, and a Ph.D. in Computer Science from the University of Reading in 2008. He is a Senior Member of the Association for Computing Machinery (ACM) and a Member of the Institute of Electrical and Electronics Engineers (IEEE), the Society for Industrial and Applied Mathematics (SIAM), and the Advanced Computing Systems Association (USENIX).

See more talks in the PASC18 Video Gallery

PASC19 takes place June 12-14, 2019 in Zurich, Switzerland.

Check out our insideHPC Events Calendar