Sign up for our newsletter and get the latest HPC news and analysis.

Fault tolerance in Win HPC installations

Timothy Prickett Morgan at The Reg on a new move for an old server company, Stratus Technologies. He starts off by nailing HPC’ers

Supercomputer customers are known for spending big bucks on exotic technology, but they’re also notorious cheapskates.

Ah, me. Sadly, he’s right. Anyway, here’s the deal

…Stratus Technologies – one of the venerable vendors of fault tolerant servers for commercial applications – is now trying to get its x64-based ftServer machinery into supercomputer sites, thanks to the charge of Microsoft into the high performance computing arena with its Windows HPC Server 2008 edition.

This is fault tolerance of the kind that banks and such use. How would it work?

In terms of Windows-based clusters, Microsoft and Stratus are suggesting that ftServers should be used in what is called the head node, as well as in the broker nodes that run the Windows Communication Foundation (WCF) stack. And perhaps file systems in baby Windows clusters could also use the ftServers too, Stratus believes, now that it is coming around to Microsoft’s thinking.

…So why bother using fault tolerance with Windows HPC? “We do a lot to harden the operating system,” explains Lane. “We do a lot of work with the I/O vendors to allow Windows and Linux to ride out transient errors.” The kind of thing you don’t want to have happen to the head and broker nodes in a supercomputer cluster. No one wants to restart a job that takes days, weeks, or months to finish.

ftServer does work on Linux, but Stratus has not yet announced a plan bring this approach to Linux clusters. So what’s all this availability going to cost you? Stratus is aiming at 50 node configurations right now, which it reckons would include one head node and 3 broker nodes, each of which would need to run on an ftServer node.

A two-socket ftServer running Windows costs somewhere between $20,000 to $25,000 in a reasonable configuration. This is not cheap, of course, but neither is HA clustering and neither is losing work.

Resource Links: