In this podcast, Brock Palen and Jeff Squyres speak with Kathryn Mohror and Adam Moody about Scalable Checkpoint/Restart (SCR). SCR is an open-source library for implementing multilevel checkpointing in clustered systems.
The SCR framework has been successful at reducing overhead on today’s systems, but we need to develop new methods as we move forward to extreme scale computing. In general, our research efforts focus on reducing the overhead of writing checkpoints even further. We are exploring strategies such as a fast node-local file system written especially for checkpoint I/O, checkpoint compression, asynchronous checkpoint movement, and managing the use of hierarchical storage on future machines, e.g., using burst buffers as an intermediate step before reaching the parallel file system.