Sign up for our newsletter and get the latest HPC news and analysis.

A silent SATA drive failure problem?

Here’s something interesting that showed up in my Google nets from Chris Mellor at Block and Files, an IT storage blog

Yesterday RAID INc. announced it was going to OEM NEC of America D-Series drive arrays because the array controller, amongst other things, carried out read integrity validation checks. This was necessary because RAID Inc. customers had reported ‘silent drive failures’ on SATA drives with not all the data on the drive being accessible by the RAID controller.

And a little more

NEC’s release about the RAID Inc. OEM deal says this about one of the things its D-Series controller carries out: ‘SATA read verification to detect silent read errors that other arrays do not.’

Another simple statement. A ‘silent’ read error, meaning that the controller doesn’t return all the information it was asked to.

I haven’t heard anything about such problems in my immediate HPC community, but would be interested in hearing from you if you have.

Is there a general problem here, or one that is only revealed in HPC configurations with hundreds or thousands of drives and a very low occurrence rate? Certainly there has been no whisper of a similar problem from other SATA drive array suppliers.

Of course I suppose it could also be the case that RAID Inc. is creating a crisis to sell its gear. Seems unlikely that this would be the case, but I’m 40 now and less trusting than I used to be.

Comments

  1. FWIW, I like real measurements. Peter Kelemen of CERN has done this and reported results.

    This is why zfs is interesting, as it may be able to detect/correct some of these errors. But then again, Google’s replication concepts, and other designs, especially the ones that provide fast error detection and correction have to be the “way” of the future. That is, in the limit of large drives and bit error rates small enough that even small arrays will trigger errors, your only real choice is a system that can tolerate errors. In the limit of large numbers of drives, your MTBF for failures approaches 0. You don’t want to be continuously scanning/rescanning for the errors as the scanning may do its own bit flipping, and it doesn’t solve the core problem.

  2. Joe, thanks for the link. The post I pointed to also cites the work of a couple others on SATA failures. You can find that info here.

  3. Hi John,

    We have been dealing with this issue in general at several HPC institutions and this is not a phantom problem that exists. I have published papers from CERN as well as Lawrence Livermore that further define the problems that they face. The intent is to acknowledge that the problem exists. ZFS is a files system which can in fact address this issue on the CPU side however, when fighting corruption in this method there is obviously overhead associated with the checks. I hope this helps and if you would like the documents please let me know.

  4. Anon-a-mouse says:

    Looking at the Peter Kelemen presentation (CERN) referenced above, I’d say this is vastly more attributable to the low-end Infortrend, Promise Tech, 3-ware, RAID Inc. (sorry Marc) and other RAID controllers…as well as the real culprit — which is parity based RAID.

    Sounds like a bunch of low-end RAID remarketers scrambling around to turn thier product bugs into opportunities for new features and PR.

    FYI, for those who haven’t done the math…parity-based raid (single or double parity, RAID 5, 6, etc. — it matters not) with huge capacity disk drives (sata or otherwise) is like playing Russian Roulette with extra bullets in the cylinder.

    When a drive fails, you multiply your bit-error rate by the number of disks in the parity group, and there is no telling whether the parity was calculated correctly before being written – so take that number and double it again. Low-end RAID controller manufacturers have known this for many years now. The number one “silent read” error happens when parity is incorrectly calculated by the RAID controller on read or write…and you never know until to need to reconstruct data from that parity (when a disk fails or times-out).

    All these RAID manufacturers are touting new “features” to correct their own design failures…and blaming the disk drives in the process. Desktop drives have better BERs and error correction than most other system components…here are the numbers (from Kelemen’s talk):

    ● NIC/link: 10^-10 (1 bit in ~1.1 GiB)
    ● DRAM memory: 10^-12 (1 bit in ~116 GiB)
    ● desktop disk: 10^-14 (1 bit in ~11.3 TiB)
    ● enterprise disk: 10^-15 (1 bit in ~113 TiB)

    Bottom line is that as disks get bigger and bigger, parity-based RAID is a bigger and bigger problem — smart storage architects stick with one of several forms of simple mirroring…especially where desktop class disks are involved. Either way I’ll bet that thorough analysis indicates that RAID parity is the culprit and this all has very little if anything to do with desktop-class vs. enterprise class disks.

Trackbacks

  1. […] Michael Garcia’s investigation of the City Council’s …Room Eight – http://www.r8ny.com|||A silent SATA drive failure problem?This was necessary because RAID Inc. customers had reported ’silent drive failures’ on SATA […]

Resource Links: