Getting reliable memory in GPUs

Print Friendly, PDF & Email

Michael Feldman at HPCwire posted a feature yesterday on improving the memory available for users who are taking advantages of GPUs for computation, not pixel pushing

GPUs are becoming more like CPUs. But in the critical area of error corrected memory, graphics hardware still lags. The lack of error correction is probably the single biggest factor that makes users of GPUs for high performance computing nervous. Some HPC applications are resistant to the occasional bad data value, but many are not. The good news is that graphics chip vendors are aware of the problem and it appears to be only a matter of time before GPUs get a memory makeover.

Memory in GPUs has been less reliable by design. Its cheaper to build the cards out of memory without ECC and it doesn’t matter much if the color of one pixel is off by a bit. An small error on a pixel in a given frame doesn’t matter much when it will be replaced 1/30 of a second later.

The picture changes when you are worried about numerical calculations where accuracy matters and the next timestep depends upon values computed at previous timesteps. Michael talks about possible solutions being explored in the industry now for the different types of errors that can occur. It’s an interesting article.