Easy and Efficient Multilevel Checkpointing for Extreme Scale Systems

Print Friendly, PDF & Email

In this video from PASC18, Leonardo Bautista from the Barcelona Supercomputing Center presents: Easy and Efficient Multilevel Checkpointing for Extreme Scale Systems.

“Extreme scale supercomputers offer thousands of computing nodes to their users to satisfy their computing needs. As the need for massively parallel computing increases in industry, computing centers are being forced to increase in size and to transition to new computing technologies. While the advantage for the users is clear, such evolution imposes significant challenges, such as energy consumption and reliability. In this talk, we will discuss how to guarantee high reliability to high performance applications running in extreme scale supercomputers. In particular, we cover the tools necessary to implement scalable multilevel checkpointing for tightly coupled applications. This includes an overview of failure types and frequency in current HPC systems. The talk will also cover the theoretical analysis necessary to achieve optimal utilization of the computing resources. Moreover, we will discuss the internals of the FTI library tool, to study how multilevel checkpointing is implemented today.”

Dr. Leonardo Bautista-Gomez is a Senior Researcher at the Barcelona Supercomputing Center where he leads the European Marie Sklodowska Curie Individual Fellowship MSC-IF project on Deep-memory Ubiquity, Resilience and Optimization (DURO). He was awarded the 2016 IEEE Computer Society Technical Committee on Scalable Computing (TCSC) Award for Excellence in Scalable Computing (Early Career Researcher). Before moving to BSC he was a postdoctoral researcher for three years at the Argonne National Laboratory, where he investigated data corruption detection techniques and error propagation. Prior to that, he did his PhD in fault tolerance for extreme scale supercomputers at the Tokyo Institute of Technology. There, he developed a scalable multilevel checkpointing library called Fault Tolerance Interface (FTI) to guarantee application resilience at extreme scale. For this work, he was awarded the 2011 ACM/IEEE George Michael Memorial High Performance Computing Ph.D. Fellow at Supercomputing Conference 2011 (SC’11), Honorable Mention. In addition, his paper “FTI : High Performance Fault Tolerance Interface for Hybrid Systems” was awarded with a Special Certificate of Recognition for achieving a perfect score at the Supercomputing Conference (SC’11). FTI is currently one of the most popular multilevel checkpointing libraries and it is the focus of multiple on-going european research projects. Before moving to Tokyo Tech, he graduated in Master for Distributed Systems and Applications from the Pierre & Mari Curie Paris 6 University.

See more talks in the PASC18 Video Gallery

Check out our insideHPC Events Calendar