Found at HPCwire
A combustion researcher may run a huge simulation of a laboratory-scale flame experiment on a supercomputer to better understand the turbulence-chemistry interactions that affect fuel efficiency. But if the system crashes, then all the data from the run is lost and the user has no choice but to start over.
Enter the need for a safety net. One such net is checkpoint restart (though it isn’t appropriate…or even possible…in every situation)
Developed by systems engineers in the Lawrence Berkeley National Laboratory’s (Berkeley Lab) Computational Research Division (CRD), BLCR was initially released to the public in November 2003 as open source software. Since then, many developers from both academia and industry have integrated BLCR into their software packages, including the MVAPICH2, OpenMPI and Cray implementations of MPI, and the Cluster Resources batch system. The original funding for BCLR development came from the SciDAC Scalable Systems Software ISIC; it is now funded through a CS base program called Coordinated Infrastructure for Fault Tolerant Systems (CIFTS) project.
Beyond enabling users to avoid having to start from scratch in the even of a system failure during a run, Berkeley’s software can help enable preemptive scheduling of “urgent” jobs. The new version is aimed at deployment on larger scale systems
The new layer that expands BCLR’s checkpoint footprint by allowing it to simultaneously run on thousands of compute nodes was developed by the Cray Center of Excellence (COE), which was established when the contract for NERSC-5, or Franklin, was awarded to Cray in 2006. The COE’s main goal is to develop innovative software for production-level supercomputing. This is achieved by allowing Cray employees to tap into the vast production expertise of NERSC staff by working from the Berkeley Lab’s Oakland Scientific Facility for two years. The production tools and software developed by the COE will utilize Cray’s release and update process, thus allowing Cray XT sites worldwide to benefit from the COE collaboration. Brian Welty, Terry Mallberg and their Cray colleagues ported and tuned BCLR for deployment on Cray systems as part of the COE.