In this video, Michael Jennings and Jackie Scoggins from LBNL describe Warewulf Node Health Check software.
TORQUE, SLURM, and other schedulers/resource managers provide for a periodic “node health check” script to be executed on each compute node to verify that the node is working properly. Nodes which are determined to be “unhealthy” can be marked as down or offline so as to prevent jobs from being scheduled or run on them. This helps increase the reliability and throughput of a cluster by reducing preventable job failures due to misconfiguration, hardware failures, etc. Though many sites have created their own scripts to serve this function, the vast majority are one-off efforts with little attention paid to extensibility, flexibility, reliability, speed, or reuse. The Warewulf developers hope to change that with their Node Health Check project.”