If you still looking for more details on the new Kepler GPUs, Nvidia has stepped up with a new GK110 Architecture whitepaper for you.
Comprising 7.1 billion transistors, Kepler GK110 is not only the fastest, but also the most architecturally complex microprocessor ever built. Adding many new innovative features focused on compute performance, GK110 was designed to be a parallel processing powerhouse for Tesla and the HPC market. Kepler GK110 will provide over 1 TFlop of double precision throughput with greater than 80% DGEMM efficiency versus 60‐65% on the prior Fermi architecture. In addition to greatly improved performance, the Kepler architecture offers a huge leap forward in power efficiency, delivering up to 3x the performance per watt of Fermi.
Is the news all good? Blogger Paul Caheny writes that the K10 in particular continues a disturbing downward trend on memory capacity per FLOPs.
A couple of high level observations on how this fits into general HPC architecture trends. Firstly the ratio of memory capacity and memory bandwidth to compute is likely to continue to decrease, signifying the increasing necessity to make use of strong scaling in applications rather than the previously rich seam of weak scaling. K10 represents a more than 60% fall in Bytes/FLOPs (memory capacity per FLOPs) compared to M2090 and a reduction of 50% in Bytes/sec/FLOPs (memory bandwidth per FLOPs) compared to M2090 (both using SP FLOPs as per K10′s target market). It will be interesting to see what the corresponding numbers are for the upcoming K20.