In this video from PASC18, Yves Robert from École normale supérieure de Lyon in France presents: Recent Results and Open Problems for Resilience at Scale.
“The talk will address the following three questions: (i) fail-stop errors: checkpointing or replication or both? (ii) silent errors: application-specific detectors or plain old trustworthy replication? In terms of workflows: how to avoid checkpointing every task?”
Yves Robert is currently professor of university at the Laboratory of Computer Science at the Ecole Normale Supérieure de Lyon (LIP). He is the author of 7 books, 130 articles published in international journals and 190 papers in international conferences. He has directed or co-directed 25 doctoral dissertations and has served on numerous editorial and scientific committees. Fellow of the IEEE, Yves Robert was elected senior member of the Institut Universitaire de France in 2007 and renewed in 2012. He is a visiting scholar at the University of Tennessee Knoxville since 2011. His main research topics are algorithmic , scheduling and resilience techniques for large scale computing platforms.