This is one of those stories that is getting so much air time on the interwebs that I feel compelled to share it with you. c|net carried a story last week about results of a research paper co-authored by U Toronto and Google researchers that shows computer memory is less reliable than expected
“We found the incidence of memory errors and the range of error rates across different DIMMs (dual in-line memory modules) to be much higher than previously reported,” according the paper jointly written by Bianca Schroeder, a professor at the University of Toronto, and Google’s Eduardo Pinheiro and Wolf-Dietrich Weber. “Memory errors are not rare events.”
…Previous research, such as some data from a 300-computer cluster, showed that memory modules had correctable error rates of 200 to 5,000 failures per billion hours of operation. Google, though, found the rate much higher: 25,000 to 75,000 failures per billion hours.
According to the article, this boils down to one in three Google servers experiencing a correctable memory error each year, with one in one hundred having an uncorrectable error. In a system with 2,500 quad-core sockets (dual core) this is a dozen or so node failures per year, clearly an issue for some of you today that are in the top third of the Top500. As we approach exascale systems with a million or more nodes, the problem will be even worse than previously thought.
Other interesting findings from the paper:
Temperature only had a “marginal impact” on the rate of errors
Hard errors” are more common than “soft errors.”
“We see a surprisingly strong and early effect of age on error rates,” the paper said. “Aging in the form of increased correctable error rates sets in after only 10 to 18 months in the field.”