Researchers from the University of Texas (Austin) plan to show a motherboard featuring four new TRIPS (Tera-op Reliable Intelligently adaptive Processing System) processors next week. From coverage at The Register:
“The processor core is composed of multiple copies of five different types of tiles interconnected via microarchitectural networks,” UT says on its website. “Each core may be configured in a single threaded mode or in a 4-thread multithreaded mode in which instructions from multiple threads may execute simultaneously. A TRIPS processor core is fundamentally distributed for technology scalability and to provide high bandwidth to the instruction cache, data cache, and register file through partitioning and replication.”
The TRIPS approach differs from the Intel and AMD approaches in the number of cores the architects believe are practical for developers to deal with and who needs to do the heavy lifting to get performance out of the additional cores.
According to Steve Keckler, an associate prof at UT quoted in The Register’s piece
“They [Intel and AMD] have made a big gamble that people writing software will figure out a way to write software that can use those processors with parallel programming…I think we will see a big wall as they try to go from 8 cores to 16 cores.”
Michael McCracken is reporting on his blog that the real innovation in this chip is its ability to execute a bunch of instructions (up to 1,024) up to 16 instructions at one time selected from a window of 1,024 possible instructions with 256 memory threads operations in flight at once [Michael wrote in to clarify some of the details I muddied. Thanks, Michael!]:
The basic idea is that instead of a single piece of control logic organizing the actions of multiple functional units, finding concurrency within a window of instructions using reordering, the TRIPS processor is distributed at the lowest level – each functional unit is a mini-processor (called a tile), and instructions executing on separate processor tiles communicate operands directly, not through a register file. Usually this is described as executing a graph of instructions instead of a single instruction at a time.
Current processors certainly don’t just execute one instruction at a time, and they do plenty of moving instructions around, so I tend to see this explicit-data-graph description as just the far end of a spectrum that starts with old superscalar designs, continues through out-of-order processors and multithreaded architectures, and currently seems to end here.
There is also a good summary at Science Blog, and Sun’s HPC Watercooler has a pointer to this paper. IBM has also been involved in the project, so perhaps there is a robust path to market for the technology if it proves effective.
(Thanks to all the readers who also pointed me to this story.)