We’ve all seen the flood of recent articles following the announcements of new HPC-targeted products in the realm of GPU-accelerated computing. We’ve all probably read the associated “speeds and feeds” numbers attached to said articles. What exactly do these numbers mean and how do they achieve such interesting speedups? insideHPC to the rescue.
Nvidia’s Tesla high performance computing product, the S1070, contains four of its latest pixel crunching GPUs, the GTX280. The GTX280 being the main focus of our discussion. Just for starters, Nvidia increased the number of stream processors [SPs] from 128 on the previous generation of GPUs to 240 on the GTX280. This is not quite a doubling of performance, but I’ll take 1.5x any day of the week.
Outside of increasing the SP density, Nvidia also modified the overall organization of the SPs. They have combined eight stream processors to form a “Stream multi-processor” or SM. Each SM also contains a dedicated L1 cache so as to prevent like-SPs from fetching data from main memory while providing simplistic shared access. Three SM units are combined to form a “Thread Processing Cluster” or TPC. Each GTX280 is comprised of ten TPC units. Alongside this:
Special function units (SFUs) in the SMs compute transcendental math, attribute interpolation (interpreting pixel attributes from a primitive’s vertex attributes), and perform floating-point MUL instructions. The individual streaming processing cores of GeForce GTX 200 GPUs can now perform near full-speed dual-issue of multiply-add operations (MADs) and MULs (3 flops/SP) by using the SP’s MAD unit to perform a MUL and ADD per clock, and using the SFU to perform another MUL in the same clock. Optimized and directed tests can measure around 93-94% efficiency.
The entire GeForce GTX 200 GPU SPA delivers nearly one teraflop of peak, single-precision, IEEE 754, floating-point performance.” [courtesy the Nvidia whitepaper]
The entire system is coupled via a shared L2 cache that allows different TPCs to share data. They have also widened the main memory path from 384bits to 512bits. So, SP to SM to TPC. Got it? Right, moving on…
Nvidia continued to push their engineers by increasing the capability of the on-board thread scheduler. The new thread sched can support nearly 30,000 threads in flight. This is an increase of almost 2.5X over the previous generation of GPU.
The biggest upgrade to the GTX280 line is the addition of a double precision floating point unit. Before you get your hopes up, each of the 240 SPs do not include a double precision unit. Rather, each SM contains a single double precision, 64bit floating point unit. This single unit is shared between the eight SPs.
A very important new addition to the GeForce GTX 200 GPU architecture is double-precision, 64-bit floating point computation support. This benefits various high-end scientific, engineering, and financial computing applications or any computational task requiring very high accuracy of results. Each SM incorporates a double-precision 64-bit floating math unit, for a total of 30 double-precision 64-bit processing cores.
The double-precision unit performs a fused MAD, which is a high-precision implementation of a MAD instruction that is also fully IEEE 754R floating-point specification compliant. The overall double-precision performance of all 10 TPCs of a GeForce GTX 200 GPU is roughly equivalent to an eight-core Xeon CPU, yielding up to 90 gigaflops.”
Nvidia has most certainly taken a step in the right direction with the addition of double precision, floating point arithmetic. The other features of their new GPUs look very enticing for certain applications and algorithms. For those waiting for an Nvidia accelerated Top500 machine, we’re not quite there yet. There is still quite a bit of development do be done on the driver, CUDA and incorporating a 1-to-1 correlation of SPs and double precision, floating point units. If Nvidia plays its cards right, they are poised to make the HPC industry even more interesting.