Over at Dr. Dobbs, Rob Farber writes that, when used correctly, atomic operations can help implement a wide range of generic data structures and algorithms in the massively threaded GPU programming environment. There’s a price to pay, though, as incorrect usage can turn massively parallel GPUs into poorly performing sequential processors.
In the future, it is likely that the need for transparent data movement will almost entirely be removed when NVIDIA enables a cached form of mapped memory. Perhaps some form of the Linux madvise() API will be used. When writing the examples for this article, I observed that mapped memory ran as fast as global memory whenever all the data fit inside a single cache line. This indicates that cached mapped memory has the potential to become the de facto method of sharing memory between the host and device(s).
Read the Full Story.