Resilience: Another Big Obstacle to Exascale Computing

March 23, 2012 by Doug Black

My Exascale panel discussion at the Structure: Big Data conference has already hit the news. Matthew Ingram from GigaOm has written up a summary of one of the key topics of the discussion: Resiliency in a system with a million nodes, hundreds of millions of cores, and billions of threads.

Speaking at GigaOM’s Structure:Data conference, Los Alamos HPC deputy division leader Gary Grider said that the exascale computer has so many parts, that some element will constantly be failing. “It wouldn’t be worth building if it didn’t stay working for more than a minute,” Grider said. “Resilience is absolutely a must. The way you get answers to science is you run problems on these things for six months or more. If the machine is going to die every few minutes, that’s going to be tough sledding. We’ve got to figure out how to deal with resilience in a pretty fundamental way between now and then.”

It was a fun discussion and I got a lot of good comments from the audience. I’d also like to thank Garth Gibson from Panasas, who’s insightful comments during the talk helped to give Exascale a rare IO perspective.

Watch live streaming video from gigaombigdata at livestream.com

Read the Full Story.

Resilience: Another Big Obstacle to Exascale Computing

Sponsored Guest Articles

Dell: Omnia Copes with Configuring HPC-AI Environments

White Papers

Energy efficiency drives HPC to the cloud

Featured RSS Feed

More News from insideBIGDATA

Resilience: Another Big Obstacle to Exascale Computing

Sponsored Guest Articles

Dell: Omnia Copes with Configuring HPC-AI Environments

White Papers

Energy efficiency drives HPC to the cloud

Join Us On Social Media

Related Posts

Featured RSS Feed

More News from insideBIGDATA