So this is good news. When I talked about Evergrid’s product release yesterday I didn’t notice that they had also partnered with Platform to integrate its Availability Services with LSF.
Recall that the Availability Services product allows applications to be checkpointed without modifications on the applications side or special treatment by the operating system. This is good for long running user applications obviously, but it also provides something very powerful for HPC center operators: preemptive scheduling.
Because jobs can be suspended and returned to execution on command, centers can now be a lot more creative with batch scheduling policies without sacrificing high utilization numbers.
From the company’s release
By recording the state of the application, Evergrid is able to checkpoint and recover from failures at near 100-percent reliability with minimal overhead. This is especially useful in high performance technical computing environments where distributed applications may run for hours and even days. Evergrid’s AvS-Batch also provides for stateful pre-emptive scheduling, which allows users to checkpoint the entire state of lower priority jobs to disk to allow higher priority jobs to run immediately.
Once the high priority jobs complete, the checkpointed applications can resume execution on available resources. This capability ensures that no compute cycles are ever lost when a job is pre-empted. Evergrid changes the nature of application pre-emption today, which, with current commercial technologies, requires the lower priority job to be stopped and restarted from the beginning, losing all work done to that point in time. Evergrid’s AvS-Batch allows a pre-empted job to resume from the checkpoint, leveraging all work done by that application up until the point of pre-emption. Stateful pre-emptive scheduling also lets commercial applications users make more efficient use of their software licenses.