When I reported on the PFLOPS super that Chinese Academy of Science has built with NVIDIA GPUs (the new Fermi-enabled Teslas), I mentioned it was using NVIDIA’s GPUDirect technology to improve the performance of GPU-to-GPU transfers. Mellanox is the lead partner with NVIDIA in developing that technology, which according to information I got in conversations with NVIDIA yesterday involved changes to the Linux kernel (done by NVIDIA) and NVIDIA’s drivers as well as Mellanox’s HCA driver. Nota bene: NVIDIA has since told me they are working on contributing those kernel changes back to the community, but in the meantime there is a patch.
Yesterday Mellanox talked a little more about GPUDirect
Mellanox…announced the immediate availability of NVIDIA GPUDirect technology with Mellanox ConnectX®-2 40Gb/s InfiniBand adapters that boosts GPU-based cluster efficiency and increases performance by an order of magnitude over today’s fastest high-performance computing clusters.
Today’s current architecture requires the CPU to handle memory copied between the GPU and the InfiniBand network. Mellanox was the lead partner in the development of NVIDIA GPUDirect, a technology that reduces the involvement of the CPU, reducing latency for GPU-to-InfiniBand communication by up to 30 percent. This communication time speedup can potentially add up to a gain of over 40 percent in application productivity when a large number of jobs are run on a server cluster. NVIDIA GPUDirect technology with Mellanox scalable HPC solutions is in use today in multiple HPC centers around the world, providing leading engineering and scientific application performance acceleration.
This is way cool, but NVIDIA says that its only first step in a line of changes that will bring improved performance to devices that hang off the PCI-e bus. For example, an SSD connected via the PCI-e could send its data directly to the GPU using this technology instead of having to go through the CPU, potentially dramatically speeding up data transfer times. This is becoming ever more important as compute nodes continue to get more powerful at a faster rate than IO channels can feed them data to work on.