What can nature teach us about really big clusters?

Doug Eadline has an interesting article in Linux Magazine from last week about what nature — in particular, really large colonies of organisms — can teach us about managing extremely large computing clusters

As we approach the really big threshold, we may need to look at how nature has solved the scale problem. The first thing that hits me is redundancy and fault tolerance. Just like the ant colony, a large cluster and the software running on it will need to adapt to constant node and network failure. As clusters get bigger, nodes and networks are going to fail at increasing rates. Programs and networks may need to manage themselves without any reliance on a single point of control or at least minimizing those points as much as possible. That is, programs may need to find resources on their own and not wait for a central authority to allocate them. There is no one telling ant number 183,234 to go pick up that crumb over there. The colony adapts dynamically to its needs.

Regardless of how well we engineer our cluster, there will always be some areas that are more sensitive to failure than others. Ultimately, it comes down to a cost-benefit type of analysis. The ultimate redundant over-kill design is of course a RAID 1 kind of cluster (build two and mirror the entire cluster). And, like clusters, ant colonies are not immune to a central failure. In my experience, they seem to recover quite well from me stomping on them with my shoe, but take out the queen and it is game over. But queens actually create new queens, thus ensuring against total ant annihilation. (Note: No ants were hurt in the production of this document.)

Emphasis mine. The whole article is a recommended read.

Trackbacks

  1. [...] John West reads more than I, and notes at InsideHPC.com blog, an article from Doug Eadline on Linux Magazine, all about really big [...]

Resource Links: