Sign up for our newsletter and get the latest HPC news and analysis.
Send me information from insideHPC:


Video: Recent Results and Open Problems for Resilience at Scale

In this video from PASC18, Yves Robert from École normale supérieure de Lyon in France presents: Recent Results and Open Problems for Resilience at Scale. “The talk will address the following three questions: (i) fail-stop errors: checkpointing or replication or both? (ii) silent errors: application-specific detectors or plain old trustworthy replication? In terms of workflows: how to avoid checkpointing every task?”

Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems

Christian Engelmann from ORNL gave this talk at PASC18. “Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. The Catalog project develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current supercomputers and extrapolates this knowledge to future-generation systems. To date, the Catalog project has analyzed billions of node hours of system logs from supercomputers at Oak Ridge National Laboratory and Argonne National Laboratory. This talk provides an overview of our findings and lessons learned.”

High Availability HPC: Microservice Architectures for Supercomputing

Ryan Quick from Providentia Worldwide gave this talk at the Stanford HPC Conference. “Microservices power cloud-native applications to scale thousands of times larger than single deployments. We introduce the notion of microservices for traditional HPC workloads. We will describe microservices generally, highlighting some of the more popular and large-scale applications. Then we examine similarities between large-scale cloud configurations and HPC environments. Finally we propose a microservice application for solving a traditional HPC problem, illustrating improved time-to-market and workload resiliency.”