Checkpointing the Un-checkpointable: MANA and the Split-Process Approach

Gene Cooperman from Northeastern University gave this talk at the MVAPICH User Group. “This talk presents an efficient, new software architecture: split processes. The “MANA for MPI” software demonstrates this split-process architecture. The MPI application code resides in “upper-half memory”, and the MPI/network libraries reside in “lower-half memory”.

Easy and Efficient Multilevel Checkpointing for Extreme Scale Systems

Leonardo Bautista from the Barcelona Supercomputing Center gave this talk at PASC18. “Extreme scale supercomputers offer thousands of computing nodes to their users to satisfy their computing needs. As the need for massively parallel computing increases in industry, computing centers are being forced to increase in size and to transition to new computing technologies. In this talk, we will discuss how to guarantee high reliability to high performance applications running in extreme scale supercomputers. In particular, we cover the tools necessary to implement scalable multilevel checkpointing for tightly coupled applications.”

High Availability HPC: Microservice Architectures for Supercomputing

Ryan Quick from Providentia Worldwide gave this talk at the Stanford HPC Conference. “Microservices power cloud-native applications to scale thousands of times larger than single deployments. We introduce the notion of microservices for traditional HPC workloads. We will describe microservices generally, highlighting some of the more popular and large-scale applications. Then we examine similarities between large-scale cloud configurations and HPC environments. Finally we propose a microservice application for solving a traditional HPC problem, illustrating improved time-to-market and workload resiliency.”

RCE Podcast Looks at Scalable Checkpoint/Restart (SCR)

In this podcast, Brock Palen and Jeff Squyres speak with Kathryn Mohror and Adam Moody about Scalable Checkpoint/Restart (SCR).