PNNL researchers are using supercomputers to take on two of the main challenges of exascale: energy efficiency and resiliency. Their simulations show that dynamic voltage scaling, also known as undervolting, can reduce power consumption and leverage existing mainstream resilience techniques at scale for improving system failure rates.
Power sources are not unlimited nor are they free, and in a computing system, primary reasons for failures include radiation from the cosmic rays, packaging materials, and temperature fluctuation. These are crucial challenges affecting the push to extreme-scale systems,” said Shuaiwen Leon Song, a research scientist with PNNL’s High Performance Computing group and a co-author of the paper describing the research. “Our undervolting method does not modify existing hardware or require pre-production machines and has shown positive results toward achieving a cost-efficient energy-savings implementation for the HPC field.”
The undervolting research is detailed in new paper, “Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing,” which will be presented this week at the 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) in Hyderadad, India. In addition to Song, authors include: Li Tan (University of California, Riverside, USA); Panruo Wu (University of California, Riverside, USA); Zizhong Chen (UC Riverside, USA); Rong Ge (Marquette University, USA); and Darren J. Kerbyson (PNNL).
For their work, the researchers had to determine if the trade-off between power savings through undervolting and performance overhead for coping with higher failure rates could reduce overall energy consumption. In the process, they would clarify if future exascale systems should trend toward using low-voltage embedded architectures for energy-efficiency or rely primarily on advanced software-level techniques to achieve high system resiliency and efficiency. Unfortunately, undervolting on its own often results in increased hard (e.g., system abort from power outage) and soft (e.g., memory bit flips) errors, diminishing its viability as a power-saving technique for HPC systems. So, targeting general faults on common HPC production machines at scale, they examined performance and energy efficiency of several HPC runs with undervolting and different mainstream resilience techniques on power-aware clusters.
The undervolting model demonstrated up to 12 percent energy savings over baseline runs (with eight HPC benchmarks) and up to nine percent savings against state-of-the-art dynamic voltage and frequency scaling (DVFS) solutions currently used to lower the operating frequency (supply voltage changes according to frequency) of hardware. Notably, the results are based on a conservative assumption of the total energy savings because the model applies the peak range of the failure rates.