resilience Archives - High-Performance Computing News Analysis

Call for Papers: Workshop on Resiliency in High Performance Computing

March 24, 2020 by staff

The 13th Workshop on Resiliency in High Performance Computing has issued its Call for Papers. The event takes place August 24 – 28 in Warsaw, Poland in conjunction with Euro-Par 2020. “Resilience is a critical challenge as high performance computing (HPC) systems continue to increase component counts, individual component reliability decreases (such as due to shrinking process technology and near-threshold voltage (NTV) operation), hardware complexity increases (such as due to heterogeneous computing) and software complexity increases (such as due to complex data- and workflows, real-time requirements and integration of artificial intelligence (AI) technologies with traditional applications).”

Filed Under: Compute, Events, HPC Hardware, HPC Software, Industry Segments, Research / Education, Resources Tagged With: AI, Euro-Par 2020, resilience, Workshop on Resiliency in High Performance Computing

Video: Recent Results and Open Problems for Resilience at Scale

August 12, 2018 by Doug Black

In this video from PASC18, Yves Robert from École normale supérieure de Lyon in France presents: Recent Results and Open Problems for Resilience at Scale. “The talk will address the following three questions: (i) fail-stop errors: checkpointing or replication or both? (ii) silent errors: application-specific detectors or plain old trustworthy replication? In terms of workflows: how to avoid checkpointing every task?”

Filed Under: Compute, Events, HPC Hardware, HPC Software, Industry Segments, Main Feature, News, Research / Education, Resources Tagged With: PASC18, resilience

Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems

August 5, 2018 by Doug Black

Christian Engelmann from ORNL gave this talk at PASC18. “Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. The Catalog project develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current supercomputers and extrapolates this knowledge to future-generation systems. To date, the Catalog project has analyzed billions of node hours of system logs from supercomputers at Oak Ridge National Laboratory and Argonne National Laboratory. This talk provides an overview of our findings and lessons learned.”

Filed Under: Compute, Events, Government, HPC Hardware, HPC Software, Industry Segments, Main Feature, News, Research / Education, Resources, Videos Tagged With: fault tolerance, ORNL, PASC18, resilience, Weekly Newsletter Articles

High Availability HPC: Microservice Architectures for Supercomputing

February 27, 2018 by Doug Black

Ryan Quick from Providentia Worldwide gave this talk at the Stanford HPC Conference. “Microservices power cloud-native applications to scale thousands of times larger than single deployments. We introduce the notion of microservices for traditional HPC workloads. We will describe microservices generally, highlighting some of the more popular and large-scale applications. Then we examine similarities between large-scale cloud configurations and HPC environments. Finally we propose a microservice application for solving a traditional HPC problem, illustrating improved time-to-market and workload resiliency.”

Filed Under: Compute, Datacenter, Editor's Choice, Events, Exascale, Government, HPC Hardware, HPC Software, Industry Segments, Main Feature, News, Resources, Videos Tagged With: checkpoint restart, HPC AI Advisory Council, microservices, Providentia Worldwide, resilience, Stanford HPC Conference, Weekly Newsletter Articles

Call for Papers: Workshop on Resiliency in High Performance Computing

Video: Recent Results and Open Problems for Resilience at Scale

Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems

High Availability HPC: Microservice Architectures for Supercomputing

Sponsored Guest Articles

‘Glow-in-the-Dark’ GPUs, Holes Burnt in Boards, Overprovisioning Systems ‘Until Funding Runs Out’ and Other Factors Calling for Optical I/O

White Papers

Energy efficiency drives HPC to the cloud

Featured RSS Feed

More News from insideBIGDATA