From Grid Today we learn of a new product launch from Evergrid Inc., the Cluster Availability Management Suite (CAMS). According to the release, CAMS
is comprised of two products, Evergrid Availability Services (AvS-Batch) and Evergrid Resource Manager (RM-Batch). Evergrid AvS-Batch captures the collective state of single or multiple nodes running distributed applications and prevents downtime by performing checkpoint, migration and recovery of the application, thus providing automatic failover across multiple nodes and tiers. Evergrid RM-Batch allows efficient allocation of resources and stateful preemptive scheduling of jobs. CAMS ensures that no compute cycle is lost by recovering, migrating or pre-empting jobs. This translates to greater flexibility, reliability and utilization of computing resources.
Evergrid has very interesting technology which I’ve seen in person a couple times over the past several years as it was being developed. What I’m particularly interested in is the Availability Services features that allows for automatic checkpointing of running applications with no intervention needed on the part of the programmer. This is something that is sorely needed for long running and/or large jobs in enterprise and HPTC, and mission critical jobs in enterprise HPC.
Evergrid provides transparent fault tolerance using an OS abstraction layer that loads between the operating system (OS) and the application. Without modifying either the application or the operating system, CAMS/AvS periodically captures the collective state of the application across the entire infrastructure while the application continues processing. By recording the state of an application and all of the OS and system state, Evergrid is able to checkpoint and resume from failures or interruptions rapidly with minimal overhead. Even failure of multiple servers or of software systems does not stop an application from being able to resume processing from a checkpoint.