Video: Bill Dally on Scaling Performance in the Post-Dennard Era

In this video from the HiPEAC16 conference, Bill Dally from Nvidia describes the challenges to scaling supercomputer performance in the post-Dennard era.

At the conference, Dally presented a paper he co-authored with Subhasis Das entitled, “Reuse Distance-Based Probabilistic Cache Replacement.”

Abstract:

“This article proposes Probabilistic Replacement Policy (PRP), a novel replacement policy that evicts the line with minimum estimated hit probability under optimal replacement instead of the line with maximum expected reuse distance. The latter is optimal under the independent reference model of programs, which does not hold for last-level caches (LLC). PRP requires 7% and 2% metadata overheads in the cache and DRAM respectively. Using a sampling scheme makes DRAM overhead negligible, with minimal performance impact. Including detailed overhead modeling and equal cache areas, PRP outperforms SHiP, a state-of-the-art LLC replacement algorithm, by 4% for memory-intensive SPEC-CPU2006 benchmarks.”

Full Transcript:

Bill Dally, Nvidia Chief Scientist and Senior Vice President of Research

Bill Dally, Nvidia Chief Scientist and Senior Vice President of Research

“It was indicated in my keynote this morning there are two really fundamental challenges we’re facing in the next two years in all sorts of computing – from supercomputers to cell phones. The first is that of energy efficiency. With the end of Dennard scaling, we’re no longer getting a big improvement in performance per watt from each technology generation. The performance improvement has dropped from a factor of 2.8 x back when we used to scale supply voltage with each new generation, now to about 1.3 x in the post-Dennard era. With this comes a real challenge for us to come up with architecture techniques and circuit techniques for better performance per watt. Two big examples of that are reducing overhead, which in today’s scalar processors tends to account for more than 99% of the energy, is not payload or arithmetic operations, rather it’s organizing and scheduling the work. In our GPUs, we tend to improve that to almost 50 % of the energy is going into useful operations.

The other big opportunity for energy efficiency is with communication, using more efficient signaling circuits we can improve signaling efficiency from 200 femtojoules per bit millimeter to about about twenty femtojoules per bit millimeter. And taken together, those can give very great improvements in performance per lot. The other big opportunity and challenge is that of parallelism. Processors aren’t getting faster anymore, they’re getting wider. That is, we’re getting more parallel cores working together to solve a problem, rather than one serial core solving it faster. And the real challenge of parallelism is one of programming. If we take the right abstractions, we can make parallel programming easy. Whereas too often today, by doing manual synchronization and manual mapping, people are making parallel programming very hard. We’re looking at trying to develop target independent programming systems and very powerful tools to help people do their mapping, so that people can get very good performance from parallel programs, but still make it very easy to program.”

Sign up for our insideHPC Newsletter