On the other hand, Dean seemingly thinks clusters of 1,800 servers are pretty routine, if not exactly ho-hum. And the software company runs on top of that hardware, enabling a sub-half-second response to an ordinary Google search query that involves 700 to 1,000 servers, is another matter altogether.
Google doesn’t reveal exactly how many servers it has, but I’d estimate it’s easily in the hundreds of thousands. It puts 40 servers in each rack, Dean said, and by one reckoning, Google has 36 data centers across the globe. With 150 racks per data center, that would mean Google has more than 200,000 servers, and I’d guess it’s far beyond that and growing every day.
I talked a little about the scope and scale of Google’s infrastructure in a post over at HPC Horizons that has sparked some follow on conversation (“Should Google be in the Top500?”, here).
Far more interesting though for HPC operations is this observation about Google’s attitude toward the hardware it buys
To operate on Google’s scale requires the company to treat each machine as expendable. Server makers pride themselves on their high-end machines’ ability to withstand failures, but Google prefers to invest its money in fault-tolerant software.
“Our view is it’s better to have twice as much hardware that’s not as reliable than half as much that’s more reliable,” Dean said. “You have to provide reliability on a software level. If you’re running 10,000 machines, something is going to die every day.”
While some companies (EverGrid comes to mind) are working on this, we are still mostly acting today with our machines built of commodity (read “cheap and breakable”) components like we have the custom built ironclad supers of yesterday. If we intend to get any actual work done with apps that run on 100,000 cores, we have to stop pretending that machines don’t go down and start writing robust software that plans for component failure.
The article goes on to talk more about the specific hardware and software approaches that Google takes to managing application robustness on its enormous, fragile, IT infrastructure.
An interesting stat on how reliable you can make things with some focus
And in a 2004 presentation, Dean said, one system withstood a failure of 1,600 servers in a 1,800-unit cluster.
Given, this was a MapReduce job and the typical HPTC workload is tremendously more diverse than the Google infrastructure, but still…we should be able to make some progress here. We often have to start completely over when we lose just 1 processor.