Resilience: Another Big Obstacle to Exascale Computing

My Exascale panel discussion at the Structure: Big Data conference has already hit the news. Matthew Ingram from GigaOm has written up a summary of one of the key topics of the discussion: Resiliency in a system with a million nodes, hundreds of millions of cores, and billions of threads.

Speaking at GigaOM’s Structure:Data conference, Los Alamos HPC deputy division leader Gary Grider said that the exascale computer has so many parts, that some element will constantly be failing. “It wouldn’t be worth building if it didn’t stay working for more than a minute,” Grider said. “Resilience is absolutely a must. The way you get answers to science is you run problems on these things for six months or more. If the machine is going to die every few minutes, that’s going to be tough sledding. We’ve got to figure out how to deal with resilience in a pretty fundamental way between now and then.”

It was a fun discussion and I got a lot of good comments from the audience. I’d also like to thank Garth Gibson from Panasas, who’s insightful comments during the talk helped to give Exascale a rare IO perspective.

Watch live streaming video from gigaombigdata at livestream.com

Read the Full Story.