Call for Papers: Workshop on Resiliency in High Performance Computing

Print Friendly, PDF & Email

The 13th Workshop on Resiliency in High Performance Computing has issued its Call for Papers. The event takes place August 24 – 28 in Warsaw, Poland in conjunction with Euro-Par 2020.

Resilience is a critical challenge as high performance computing (HPC) systems continue to increase component counts, individual component reliability decreases (such as due to shrinking process technology and near-threshold voltage (NTV) operation), hardware complexity increases (such as due to heterogeneous computing) and software complexity increases (such as due to complex data- and workflows, real-time requirements and integration of artificial intelligence (AI) technologies with traditional applications). Correctness and execution efficiency, in spite of faults, errors, and failures, is essential to ensure the success of the HPC systems, cluster computing environments, Grid computing infrastructures, and Cloud computing services. The impact of faults, errors, and failures in such HPC systems can range from financial losses due to system downtime (sometimes several tens-of-thousands of Dollars per lost system-hour), to financial losses due to unnecessary overprovision (acquisition and operating costs), to financial losses and legal liabilities due to erroneous or delayed output.

The emergence of AI technology opens up new possibilities, but also new problems. Using AI technology for operational intelligence that enables resilience in HPC systems and centers is a complex control problem, while designing resilient AI technology for HPC applications is a difficult algorithmic problem. Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, error/failure and anomaly detection, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient algorithms.

This workshop brings together experts in the community to further research and development in HPC resilience and to facilitate exchanges across the computational paradigms of extreme-scale HPC, cluster computing, Grid computing, and Cloud computing.

Topics of interest include, but are not limited to:

Theoretical foundations for resilience:

  • Metrics and measurement
  • Statistics and optimization
  • Simulation and emulation
  • Formal methods
  • Efficiency modeling and uncertainty quantification
  • Experience reports

Error/failure/anomaly detection and reliability/dependability modeling:

  • Statistical analyses
  • Machine learning and artificial intelligence
  • Digital twins
  • Data collection and aggregation
  • Information visualization

Monitoring and control for resilience:

  • Center, system and application monitoring and control
  • Reliability, availability, serviceability and performability
  • Tunable fidelity and quality of service
  • Automated response and recovery
  • Operational intelligence to enable resilience

End-to-end integrity:

  • Fault tolerant design of centers, systems and applications
  • Forward migration and verification
  • Degraded operation
  • Error propagation, failure cascades, and error/failure containment
  • Testing and evaluation, including fault/error/failure injection

Enabling infrastructure for resilience:

  • Reliability, availability, serviceability systems
  • System software and middleware
  • Resilience extensions for programming models
  • Tools and frameworks
  • Support for resilience in heterogeneous architectures

Resilient algorithms:

  • Algorithmic detection and correction
  • Resilient solvers and algorithm-based fault tolerance
  • Fault tolerant numerical methods
  • Robust iterative algorithms
  • Resilient artificial intelligence

Submissions are due May 8, 2020.

Check out our insideHPC Events Calendar