Highly Reliable Linux HPC Clusters: Self-Awareness Approach

Print Friendly, PDF & Email

Found at Intel’s SW blog

Current solutions for fault-tolerance in HPC systems focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration changes caused by transient failures and require a complete restart of the entire machine. The recently released HA-OSCAR software stack is one such effort making inroads here. This paper discusses detailed solutions for the high-availability and serviceability enhancement of clusters by HAOSCAR via multi-head-node failover and a service level fault tolerance mechanism.

Our solution employs self-configuration and introduces Adaptive Self Healing (ASH) techniques. HA-OSCAR availability improvement analysis was also conducted with various sensitivity factors. Finally, the paper also entails the details of the system layering strategy, dependability modeling, and analysis of an actual experimental system by a Petri net-based model, Stochastic Reword Net (SRN).

Entire paper here as PDF.


  1. I’m glad to see ASH becoming more common. It really seems a no-brainer for someone to embrace it and emerge as “THE” HPC provider for time-critical projects.

  2. Not really a novel idea, we’ve had redundant head nodes and adaptive healing for a long time. Our software often detects failures before they happen. Obvious stuff.

  3. Aaron – Can you share more? What software, what applications, what size cluster, etc.

  4. I’d be interested in hearing more, too. Isn’t this a case where, sure, reliability can easily be come by via redundancy, but that redundancy costs money and that money is often (not always, but often!) spent on faster, bigger systems to begin with? Sure, you may lose a job or two every now and then on a system without redundancy, but your throughput is higher due to the extra resources.

    Mission-critical stuff, sure, make it reliable. The other 98% of stuff? With MTBF rates being relatively low and the amount of stuff to get done ever-increasing, I’d rather take a bigger, faster system and just re-run things if needed. What am I missing?

  5. Brian – by the way, thanks for all the comment activity lately. Keep it coming.

    Regarding reliability, my personal interest in it is for very large scale systems. Machines beyond 100,000 cores. The issue I see is that while MTBFs are relatively low for any individual component, one usually buys large machines (but not always) to run larger jobs. Using thousands of cores on individual jobs means that the odds of a hardware failure on a socket or core (or network, hard drive, RAM stick, etc.) that will impact that particular job trend toward a certainty as runtime increases. Most of the computational infrastructure (MPI, OS, etc) is too brittle to preserve a job in the face of, say, a processor drop out, and the job therefore doesn’t complete. Sufficiently large or long jobs may never complete.

    That sort of thing. At the lower end I think it only makes sense in certain sectors.

  6. Hi John,

    I see.. that makes a bit more sense. I still think, though, that most people will follow the path of least resistance in cases like this – that is, checkpoint often, and on failure, resume from the latest. If we’re running on 100K cores and one fails every -hour- or so, then checkpoints obviously must be very frequent indeed, and the usefulness of this should rightly be called into question. But full redundancy (along the lines of RAID ‘mirror’, albeit for all components) means that you -could- just be ignoring that and running twice as fast. So in the end, the equation will be whether your job finishes faster with all this checkpointing and restarting to handle failures, or through not having to do that but running at ‘half’ speed.

    Being (as my earlier rants are evidence of) more of a ‘smart’ algorithm proponent, I’d love to see pieces of software that say, “Oh, I lost that processor? No problem.. interpolate an approximation to the values we would’ve had there, and for the next time-step redistribute our mesh over the neighboring processors.”, and just carries on. Eventually, sure, enough failures will become catastrophic, but if we’re running on 100K cores, I’d like to think we can write software which is capable of proceeding when losing 1/100,000th of its data. This is the 21st century, after all. 😉

    To continue the ‘RAID’ theme, I can see a solution which uses a ‘RAID5’ type of methodology in sending parity information to nearby processes that would allow for the reconstruction of a failed node/core/link, whatever. The problem though is that (unlike full redundancy) this would still require the re-computing of whatever it was the failed core was supposed to be doing, and on 100K cores, that means you’ve got 99,998 idle while 1 recomputes this stuff, and then moves on. Inefficiencies are the bane of the HPC world, so I can’t see that being terribly viable either.

    – Brian

  7. I think your smart algorithm approach is right, and that’s the sort of thing I’m getting at when I talk generically about new algorithms and new computation support infrastructure being needed to move forward in HPC.

    Regarding the RAID idea, I think it could get interesting in certain situations. I think checkpointing is totally out for large jobs — you just can’t get enough bandwidth. Or if you can get enough bandwidth to disk this year, you can’t next year when you double your cores. And then the question becomes not one of “wasting” cores, but one of getting a big job done at all. Actually parity cores are already being used in some of the larger purpose-specific chips (I’m told that Cisco’s 188-core Metro network processor does this). So the idea is that core mirroring (for example) is managed at the socket level by the bottom of the OS or in the firmware itself and never really available for direct access. This doesn’t make sense with 2 cores, but with 200, or 2000, the economics could be correct in cases where being able to finish large scale jobs was important enough.

    I think the big conceptual point is that we have so many FLOPS, and we are usually only getting at 1% of the peak FLOPS on the commodity chips, that we should stop counting them as valuable and focus more on the work and the productivity/creativity of the humans that use the machines. I would much rather be able to use 100% of 1,000 processors than the 1% of 100,000 processors I can get at today.