This is the forth article in a series takes from insideHPC Guide to Production Supercomputing and Systems Management. This 5 part article series will explain how a properly managed HPC systems will lower the total cost of ownership of your supercomputing programs. This article looks at the power management of your supercomputer.
Today’s HPC supercomputers have significant power requirements that must be considered as part of their Total Cost of Ownership. In addition, efficient power management capabilities are critical to sustained return on investment.
SGI Management Suite delivers comprehensive power management on systems that support the Intel Power Node Manager on Intel Xeon servers.
The power management tool forecasts power by collecting power metrics by node and rack. The metrics are represented in watts. The sum of the power on the nodes comprises the power resources for the whole system – data that can be used for forecasting the power utilization for the entire datacenter.
Power is managed through a feature called power capping. This occurs when server processor speeds are lowered, which in turn reduces power usage and slows performance. This has proven to be a useful tool to manage power resources and avoid unscheduled system shutdown due to power failures.
There are a variety of reasons to use this capability. For example, power capping can be implemented before the datacenter’s maximum power resources are exceeded due to higher air conditioning use during the summer months.
Power capping can be automatic, providing a real-time proactive response to avoid system failure. The power management tool detects inlet temperature changes and cap power automatically for chiller failure when the datacenter temperature reaches 45°C or registers other extreme temperature changes.
In Japan, power capping has taken on a new urgency due to power restrictions imposed following the Fukushima nuclear power plant disaster.
The SGI Management Suite allows control of power resources through Altair PBS Professional’s power awareness feature. This capability was the result of a collaboration between Altair and SGI to develop the ability to assign power resources per job and account for the consumed energy after the job completes.
The Altair feature allows the user to define power envelopes that best match the application’s runtime performance and supports the balancing of power resources between jobs. Idle nodes can have their power capped and more power can be provided to active nodes.
Power measurement and management is fully supported for the SGI ICE and Rackable HPC Systems and SGI UV servers support power measurement only.
Below are a few use cases that illustrate how SGI Management Suite’s power management is being used under a variety of circumstances:
• Europe – At this overseas facility, the power cost has been included in the systems management budget, limiting the amount available for hardware. The solution was to more accurately forecast the power cost for the datacenter’s SGI systems. This allowed the systems managers to procure more servers using the power resource savings.
• US Research Center – This datacenter has a 4 megawatt limit. Power management monitors power usage and automatically caps the power before the limit is exceeded.
• US Research Center – At this data center the chillers failed. Power was capped on specific nodes to prevent the datacenter from overheating.
• Japan – Summer power demands exceeded available power resource leading to brownouts at this datacenter. Power was capped for its SGI systems based on available power resources.
• Resource Management – This datacenter wanted to implement per per-job power management. The SGI solution enables workload management tools to add power as another configurable resource.
In the next article we’ll review the SGI MEMlog Memory Error Manager and Logger software tool. If you prefer you can download the complete insideHPC Guide to Production Supercomputing and Systems Management, courtesy of SGI and Intel – Click Here.