MemVerge and Open Source Community Partnering to Protect Distributed HPC Apps with DMTCP

ST. LOUIS – NOVEMBER 15, 2021 – At SC21 today, MemVerge and the DMTCP Project announced a partnership designed to accelerate development and adoption of long-awaited Distributed MultiThreaded Checkpointing (DMTCP) technology.

Checkpointing is commonly used by enterprise apps to minimize downtime but checkpointing is almost impossible for complex distributed HPC apps with massive data sets. Under development for over a decade, DMTCP has recently made the impossible possible for several workloads including VLSI circuit simulators, circuit verification, formalization of mathematics, bioinformatics, network simulators, high energy physics, cybersecurity, big data, middleware, mobile computing, cloud computing, virtualization of GPUs, and high performance computing (HPC). DMTCP stands ready for commercialization and wider deployment.

The collaboration between the DMTCP Project and MemVerge will facilitate DMTCP’s move into the market. The partnership includes MemVerge developers joining the DMTCP Project and contributing to open-source development; MemVerge providing commercial support for the open-source DMTCP software; and MemVerge integrating the fully tested and supported version into application-specific Big Memory Solutions. MemVerge has also begun a collaboration with the National Energy Research Scientific Computing Center (NERSC) to optimize MPI-Agnostic Network-Agnostic (MANA), a plugin on top of DMTCP that has been used for transparent checkpointing of MPI on the Cori and Perlmutter supercomputers.

“Robust, performant checkpointing offers us flexibility in scheduling jobs for system maintenances and real-time data processing for experimental facilities. This feature also allows us to better backfill jobs, which ultimately leads to increased system utilization and improved job throughput for our nearly 8,000 scientific users,” said Rebecca Hartman-Baker, User Engagement Group Lead, National Energy Research Scientific Computing Center (NERSC), Lawrence Berkeley National Laboratory.

Gene Cooperman, Professor at Northeastern University, and leader of the DMTCP Project, has led this open-source DMTCP project for almost 20 years. He is especially excited about the recent three-way collaboration to support MANA for MPI.

According to Professor Cooperman, “The collaboration among NERSC/LBNL, MemVerge, and the DMTCP open-source community will bring reliable and efficient transparent checkpointing to MPI (and later to CUDA) for the production market. While DMTCP and MANA will always remain free and open source, the use of MemVerge technology for rapid writing of memory to stable storage will bring an important enhancement to this technology.”

“Distributed checkpointing is a perfect complement to ZeroIO In-Memory Snapshot technology that MemVerge has pioneered,” said Charles Fan, CEO of MemVerge. “We look forward to collaborating with the DMTCP community on future technology and market development.”

“Being able to seamlessly and graciously recover from system failures during complex simulation runs is critical to optimize efficiency for completing jobs with long run-times,” said Mark Nossokoff, Senior Research Analyst at Hyperion Research. “Checkpointing is a well-understood technique for saving the states of independent node memory during a failure mode
and restoring that state when the machine is back up and running. Bringing checkpointing capability to big memory architectures with pooled, distributed memory across multiple nodes operating on large datasets should further enable adoption of in-memory computing techniques within the HPC and AI communities. Kudos to MemVerge for stepping up to provide the industry stewardship to make DMTCP a commercial reality.”

About DMTCP and The DMTCP/MANA Project
DMTCP (Distributed MultiThreaded Checkpointing) transparently checkpoints a single-host or distributed computation in user-space — with no modifications to user code or to the O/S. It works on most Linux applications, including Python, Matlab, R, GUI desktops, MPI, etc. It is robust and widely used (on Sourceforge since 2007). MANA is an implementation of transparent checkpointing for MPI. MANA is under continuing development, but has already demonstrated robust, transparent checkpointing for computations with 1,000 MPI processes.

About MemVerge
Memory is too small. Storage is too slow. In response, MemVerge is pioneering the new category of Big Memory Computing that simultaneously delivers the nanosecond performance of memory with the massive capacity of storage. MemVerge Memory Machine Software transparently virtualizes different types of memory hardware into a pool of software-defined memory with the same performance of DRAM, but with many times the capacity. On top of the transparent memory service, Memory Machine provides the industry’s first suite of data services that can provision petabytes of capacity, performance, availability, and mobility at the speed of memory and across the clouds. To learn more about MemVerge, visit www.memverge.com.