This is the final article in a series takes from insideHPC Guide to Production Supercomputing and Systems Management. This five part article series will explain how a properly managed HPC systems will lower the total cost of ownership of your supercomputing programs. This article looks at the SGI Management Suites’ ability to maintain a maintenance scheduled for your supercomputer.
The SGI MEMlog Memory Error Manager and Logger software tool is one of the SGI Management Suite’s features that is popular with systems administrations and datacenter managers. Many of these customers say they buy SGI systems because of the SGI MEMlog tool. In particular, they appreciate the ability to maintain scheduled maintenance windows, the cost savings from fewer DIMM replacements, and happy users who experience fewer interruptions to their jobs while getting maximum performance bandwidth.
SGI MEMlog tool is the only software-centric solution that preserves maximum performance bandwidth with a proactive resolution of memory errors while reducing unscheduled downtime and the datacenter’s budget for replacement memory.
The tool collects memory error information from the SDDC+1 memory DIMMs in the SGI servers and determines when the physical memory needs to be replaced. Key to this function is the SGI MEMlog Transient Error filter that determines whether an error is transient – meaning that the error goes away on the memory DIMM – or whether the error is persistent, in which case the error is a hard error that requires physical memory replacement.
The SGI MEMlog tool provides predictive failure analysis for memory and insurance for running jobs. For failed memory, the SGI MEMlog tool retires the failed memory in 4K memory pages while moving the running job from bad to good memory without any interruption or shutdown of the production system.
With the SGI system, end-users get the most memory bandwidth for their jobs because the SGI MEMlog tool manages memory errors through software, rather than with physical memory on stand-by.
Here are several customer use cases that illustrate SGI MEMlog in action:
- Alabama Supercomputer Center – The problem facing systems managers at the Center was to maintain system stability. According to David Young, the Center’s HPC Group Leader, “MEMlog played a key role in our decision to purchase an SGI UV as MEMlog addresses a stability concern for long running simulations.”
- US Research Lab – This lab wanted to identify failing memory DIMMs and determine when to replace them. The SGI MEMlog tool helps systems administrators avoid replacing DIMMs too early, and at the same time helps identify failing DIMMs more quickly that conventional solutions.
- US Space Research Center – DIMM replacements did not correct the system’s numerous memory errors. On the SGI Rackable HPC System, The SGI MEMlog tool was able to separate the transient errors from the hard failures. With the SGI MEMlog transient memory filter, the DIMM replacement rate went from 10-20 DIMMs/week to 3 DIMMs in 6 weeks.
- US Research Lab – This lab needed to easily monitor and locate failed DIMMs. In the SGI MEMlog tool reports, each entry includes the DIMM location information by rack, node, socket, DIMM, the number of errors, and the type of error. In 6 months, 30 out of 5472 DIMMs were replaced on the SGI system.
Properly managed HPC supercomputers can deliver sustained Return on Investment for production supercomputing. To ensure ongoing operations are at peak efficiency, system management capabilities should not only include system monitoring and configuration tools, but also workload management, automated power capping, and error and fault detection capabilities.
System management software reduces the time and resources dedicated to administering systems by improving software maintenance procedures and automating repetitive tasks. This lowers total cost of ownership, increases productivity, and provides an improved return on hardware investments.
The SGI Management Suite provides a wide array of tools for provisioning systems health management and power management of SGI computing systems. Included are powerful tools to initiate management actions, monitor essential system metrics, and improve both overall memory and power efficiency. Designed to solve problems before they become significant production issues, the Suite includes comprehensive systems management for today’s SGI systems and the production supercomputers of tomorrow.
Other benefits include:
- Fewer unscheduled shutdowns (with power management and memory error management.)
- Fewer memory replacements, since not all memory failures are hard (persistent) failures.
- For scheduled shutdown, software updates are faster. Multicast provisioning, hardware replacement is faster because you have logged the problem hardware (memory) to be replaced at next scheduled shutdown.
- Enhanced uptime makes for happy users