The AMD Stream Processor

Print Friendly, PDF & Email

Since this was announced at SC’06 it isn’t exactly news, but it is interesting and dovetails with the GPU discussion we had earlier.

AMD is doing their own accelerated processing in the form of the Stream Processor, developed out of the IP from recently acquired ATI. The board is essentially a repurposed ATI Radeon X1900 core and essentially functions like a vector processor (with slight differences; Chris at HPC Answers has an excellent discussion of the technology). AMD also provides a low overhead interface that allows you to get to the capability quickly called CTM (for Close To Metal; cute, huh?).

You add the Stream Processor in an existing system via PCI Express along with up to a gigabyte of memory. Oddly, it doesn’t connect via the Torrenza interface. The chip on the board has 48 cores, and ATM claims it can deliver 375 GFLOPS.

In its incarnation as a graphics board the X1900 consumes around 120 watts, and with the additional memory in the Stream Processor configuration can be expected to consume even more. Contrast this with the ClearSpeed accelerator board with two 96 core processors, which delivers 100 GFLOPS but only consumes 50 watts.

From Chris’ analysis (which I really recommend you read):

The Stream Processor is different from the CUDA technology in the GeForce 8800 in that the latter has cooperating cores and can therefore run multithreaded applications without stream programming. That is, AMD’s approach is a vector processor—SIMD—whereas NVIDIA’s approach is a multithreaded processor—MIMD. (To be precise, a stream processor applies a “kernel” of related instructions stored in a cache, whereas a vector processor applies a single instruction stored in a register; for our discussion, the difference is minimal.) This SIMD vs. MIMD divide also appears when comparing ClearSpeed and the Cell BE.

The whole accelerator/FPGA/GPU movement is interesting in that it marks a change in the swing of the pendulum.

We moved away from the proprietary HPC hardware of the 80s and early 90s to a pure commodity play in the late 90s and early years to of this decade.

That move was driven largely by price performance, and in making the transition we gave up efficient access to chip performance. For a while we made up for this by riding commodity prices down and buying ever larger systems with astronomical raw performance.

But as prices continue to fall customers are evidently starting to believe that they’ve done all they can with raw FLOPS (and they’ve bought all they can power and cool). HPTC customers now appear ready to spend a little extra add performance that leverages the cost structure of commodity chips while for direct benefits to their applications.

This is a Good Thing in as much as it means that people are once again realizing there really is no such thing as a free lunch.