This is the third article in a series takes from insideHPC Guide to Production Supercomputing and Systems Management. This 5 part article series will explain how a properly managed HPC systems will lower the total cost of ownership of your supercomputing programs. This article looks at the SGI Management Suites’ ability to monitor and manage your supercomputing programs.
The SGI Management Suite’s system health monitoring and management capability collects health status information on fundamental system’s functions such as memory, CPU and power. It identifies changes that require action, automatically alerts the system administrator, and provides proactive solutions to correct the problem.
SGI Management Suite uses the open source software, Ganglia and Nagios, to collect system health data and alerts the system administrator of important changes that require attention.
For extra protection without system administrator intervention, SGI Remote Services provides 24 hour daily monitoring of the system logs for alerts that are immediately forwarded to SGI Technical Support for analysis, action, and communication with the customer.
SGI MEMlog™ memory error manager and logger is SGI’s memory error management tool that analyzes memory failure data and takes proactive action for jobs running on failed or failing memory.
Another major capability of the SGI Management Suite is power management which is important for power capacity management and proactive power limiting in response to datacenter changes.
The remainder of this article series is devoted to an overview of power management and another equally important (and highly popular) capability – the SGI MEMlog software tool. If you prefer you can download the complete insideHPC Guide to Production Supercomputing and Systems Management, courtesy of SGI and Intel – Click Here.