fault tolerance Archives - High-Performance Computing News Analysis

Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems

August 5, 2018 by Doug Black

Christian Engelmann from ORNL gave this talk at PASC18. “Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. The Catalog project develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current supercomputers and extrapolates this knowledge to future-generation systems. To date, the Catalog project has analyzed billions of node hours of system logs from supercomputers at Oak Ridge National Laboratory and Argonne National Laboratory. This talk provides an overview of our findings and lessons learned.”

Filed Under: Compute, Events, Government, HPC Hardware, HPC Software, Industry Segments, Main Feature, News, Research / Education, Resources, Videos Tagged With: fault tolerance, ORNL, PASC18, resilience, Weekly Newsletter Articles

Energy efficiency drives HPC to the cloud

The high-performance computing (HPC) market is witnessing a notable shift towards the cloud, partially driven by the benefits of enhanced energy efficiency. According to Hyperion Research nearly every organization running HPC workloads is either already using or investigating the cloud to accelerate application performance, with the cloud market for HPC workloads forecast to reach $11.5 […]

Download

Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems

Sponsored Guest Articles

‘Glow-in-the-Dark’ GPUs, Holes Burnt in Boards, Overprovisioning Systems ‘Until Funding Runs Out’ and Other Factors Calling for Optical I/O

White Papers

Energy efficiency drives HPC to the cloud

Featured RSS Feed

More News from insideBIGDATA