Improving Supercomputer Resiliency with Containment Domains

Print Friendly, PDF & Email

As we approach Exascale levels of computing over the next decade, the ever-increasing numbers of components in supercomputers means that something somewhere is going to break at any time. To tackle this resiliency problem, Matan Eriz from the University of Texas at Austin and his colleagues are collaborating with Cray to develop a new approach called containment domains.

The organization of hierarchical containment domains.

We are working to bring resilience to a footing similar to more traditional programmer concerns,” says Mattan Erez, UT associate professor of electrical and computer engineering. “Containment domains are the abstraction we came up with that satisfies these goals and can be used consistently across system and programming layers.” 

A containment domain is a programming device that isolates an algorithm until all its components and iterations have been completed, checked for accuracy, and corrected, if necessary. Only after the resulting data pass these tests are they allowed to serve as inputs in subsequent algorithms.

Read the Full Story.