Storage pundit Henry Newman writes that running checksums for large data archives is quickly becoming an HPC problem:
Today, many preservation archives are well over 5PB and a few are well over 10PB with expectations that these archives will grow to more than 100PB. With archives this large, the requirements for HPC architectures for checksum validation are not much different than many of the standard HPC simulation problems, such as weather, crash, and other simulations.
I’ve always thought of large-scale archiving as an IO problem, but I was talking to Henry about this a few weeks ago and he described the monumental problem of validating archive data on a regular basis:
To validate the checksum for a file, the whole file must be read from disk or tape into memory and have the checksum algorithm applied to the data read and then compare the checksum that was just calculated to the stored checksum, which should be checksummed also so you are sure that you have a valid checksum to compare to the file you read into memory. With large archive systems, this is often an ongoing process whether the data resides on disk or tape, but checksum validation is particularly critical for disk-based archives with consumer-grade storage.
We tend to think of HPC devices as general-purpose number crunchers. It could be that the vendor who invents the better mousetrap for checkbit sums will be the next company to enjoy the big margins enjoyed by the supercomputing industry in the 80’s. Full Story
I would never have thought that checksums could became such a big issue.