Video: Debugging HPC Applications at Massive Scales

In this video, LLNL scientists discuss the challenges of debugging programs at scale on the Sequoia supercomputer, which has 1.6 million processors.

Key insights:

  • Bugs in parallel HPC applications are difficult to debug because errors propagate among compute nodes, programmers must debug thousands of nodes or more, and bugs might manifest only at large scale.
  • Although conventional approaches like testing, verification, and formal analysis can detect a variety of bugs, they struggle at massive scales and do not always account for important dynamic properties of program execution.
  • Dynamic analysis tools and algorithms that scale with an application’s input and number of compute nodes can help programmers pinpoint the root cause of bugs.

Read more at ACM Web * Sign up for our insideHPC Newsletter

Comments

  1. Gentlemen,

    Consider trying out the Automatski AutoSIM IoT Simulator for debugging massive scale Algorithms. its an IoT simulator but can be easily repurposed for HPC debugging at the scale of billions of “REPEATABLE” & “DEBUGGABLE” events/second