Over at the Nvidia Developer Zone, Mark Harris looks at how to efficiently access device memory, in particular global memory, from within kernels.
Global memory access on the device shares performance characteristics with data access on the host; namely, that data locality is very important. In early CUDA hardware, memory access alignment was as important as locality across threads, but on recent hardware alignment is not much of a concern. On the other hand, strided memory access can hurt performance, which can be alleviated using on-chip shared memory. In the next post we will explore shared memory in detail, and in the post after that we will show how to use shared memory to avoid strided global memory accesses during a matrix transpose.
Read the Full Story.