Vuduc wins NSF CAREER Award to make HPC better "by any means necessary"

In early June the NSF announced that Georgia Tech’s Richard Vuduc received an NSF CAREER Award for his work in tuning software to run on parallel systems. From the NSF website

NSF logoThe Faculty Early Career Development (CAREER) Program is a Foundation-wide activity that offers the National Science Foundation’s most prestigious awards in support of junior faculty who exemplify the role of teacher-scholars through outstanding research, excellent education and the integration of education and research within the context of the mission of their organizations. Such activities should build a firm foundation for a lifetime of leadership in integrating education and research.

The name of his proposal, “Autotuning foundations for exascale systems”, attracted my attention and Rich agreed to tell us a little about himself, his work, and this prestigious award.


insideHPC: First, can you tell the readers a little about yourself? What’s the 100 word bio of Rich Vuduc?

Richard VuducRich Vuduc: I am an assistant professor at Georgia Tech in the School of Computational Science and Engineering, which is (Shameless Plug Alert) one of the country’s few full-fledged academic departments devoted to the systematic study, creation, and application of computer-based models to understand and analyze natural and engineered systems. HPC is a major research and teaching focus in this kind of department, because computational scientists often care a great deal about effective use of parallelism in large systems. My research lab, The HPC Garage, is looking at automating and simplifying the analysis, programming, tuning, and debugging of software for emerging and future parallel machines.

On a more personal note, I am Vietnamese-American and my favorite TV show is “The Wire.” For TV skeptics, The Wire is proof that a TV series can be great art!

insideHPC: Looking at your web pages, it seems like you are, well, more fun than most of the profs I remember. “HPC Garage” for example. Is that a conscious effort on your part to engage more creative people, or just a natural extension of your personality?

Vuduc: Thanks, though I don’t know if “more fun” necessarily means “better research and teaching.”

I went to grad school and did my postdoc in the Bay Area, and am greatly inspired by the famous Hewlett-Packard Garage—so, too, is my lab a small team of creative hands-on tinkerers with limited resources and big dreams of building better, well, instruments and “calculators” for scientific advancement.

insideHPC: Your research area is in tools for getting better performance out of high end systems by software methods rather than human intervention. Can you generally describe your work in this area? Is any of it part of a library readers may be using? How does it fit in the context of other efforts, like ATLAS?

Vuduc: Yes, our goal is to simplify the process of achieving truly high performance, “by any means necessary,” if I may pay small tribute to my radical Berkeley roots. Accomplishing this goal might mean giving parallel programmers an auto-magic toaster that makes slow code fast. However, I would also be happy with more modest achievements, like distilling useful new performance principles or practices; making productive programming models fast; or providing more insight into what architectures work for particular interesting and important classes of applications, and why.

People who recognize my name probably know it from my early work in the area of autotuning on a library called OSKI, the Optimized Sparse Kernel Interface, which was developed while I was a graduate student “bebopper” in Jim Demmel’s and Kathy Yelick’s BeBOP group at Berkeley. (OSKI is also the Cal mascot. Go Bears!) OSKI is like Clint Whaley’s well-known ATLAS library, but is for sparse matrices rather than dense ones. The methodology is different in the sparse case, where one might not only tune the code, but also change the data structure at run-time, depending on the input matrix. Sam Williams (LBNL) greatly extended the OSKI techniques for multicore, and Jee Choi, one of my students, has some cool extensions for GPUs. As for sequential OSKI, I know Mike Heroux at Sandia has an effort to put wrappers around it for Trilinos.

These days, my lab is looking at autotuning techniques for a broader variety of interesting irregular and highly-adaptive computations, both in statistical machine learning (jointly with Alex Gray at GT) and for tree-based n-body problems (jointly with George Biros, also at GT).

insideHPC: Thinking specifically about your CAREER award, could you briefly talk about the award, what it is, what it means for you professionally, and what it means for you personally.

Vuduc: The CAREER award is an angel investment! I am extremely grateful that there are people willing to take a chance on my lab’s work and on my teaching (probably a bigger risk, the latter). Receiving the award means I have both the duty and the privilege to do something impactful.

It’s also a nice nod to my senior faculty mentors at GT, David Bader and Richard Fujimoto. Their efforts and advice have not been lost.

insideHPC: Your proposal is called “Autotuning foundations for exascale systems” — can you talk about the work you plan to do?

Vuduc: In perhaps overly basic terms, we hope to simplify programming and tuning on future exascale systems using autotuning techniques.

The proposal has two major research thrusts, one that explores analytical and statistical performance models to guide tuning, and another that explores tuning in emerging dataflow-like programming models. In both cases, we want methods that work on (a) the kinds of sparse, irregular, adaptive computations that I’ve been studying for some time now and that are a particular challenge to scale; and (b) the kinds of systems we can expect to see at exascale, which I am told will have “absurdly heterogeneous manycore nodes.” Both thrusts build on collaborations with Kath Knobe and C.-K. Luk, both at Intel. If we are successful, we will contribute to a goal that folks like David Bailey (LBNL) and Robert van de Geijn (UT Austin) sometimes refer to as one of developing a “science” of performance programming and engineering. That’s what “foundations” refers to.

Like all CAREER proposals, there is also an integral educational thrust tied to the research. In my case, the gist is to design and implement a year-long lab practicum, called The HPC Garage Practicum, that is a true interdisciplinary team-based competition, aimed at early-stage graduate students. The competition is to develop the most scalable code that answers real-world scientific questions; think of the famed Gordon Bell and X-Prize competitions. The basic inspiration arose in conversations with Pablo Laguna, Deirdre Shoemaker, and George Biros at GT. The approach is in the style of the GT School of CSE’s mission, to train the next generation of computational scientists in interdisciplinary teamwork.

By the way, if any corporations would like to donate prizes for the winning teams in this effort, we are soliciting.

insideHPC: Is this work a “scaling up” of the earlier work you’ve done, or are there specific things that you’ll need to change to address the challenge of running on exascale class machines?

Vuduc: We are scaling up, but not just the platform—-we are also working in larger “algorithmic contexts.” I mean that whereas my earlier work focused on relatively compact kernels, my lab these days is looking at autotuning progressively more complex multiple-kernel solvers, with an eye toward large applications. This requires working more closely with domain scientists and compiler people, like my former postdoc mentor, Dan Quinlan at LLNL. The work my students, Aparna Chandramowlishwaran and Aashay Shringarpure, have done for the fast multipole method on multicore- and GPU-based distributed memory systems is a great first example.

insideHPC: How do you go about designing software for a class of machines that not only hasn’t been built, but for which there isn’t even a design consensus yet?

Vuduc: It’s always a difficult problem, but a “classical” approach is to change the program representation, as suggested, for instance, by Jeff Bilmes (UW) and Krste Asanovic (UCB) in their PHiPAC project. In particular, rather than writing a specific program, you write a program generator that can produce many different versions of the program. The generator might encode the generation of entirely different algorithms. Perhaps the most aggressive and successful examples of this approach today are the SPIRAL (Markus Pueschel at CMU) and FLAME (van de Geijn) projects. It’s not easy to do this but is in my view a promising way forward.

In my CAREER proposal, part of what we plan to do is work with Kath (Knobe at Intel) to use her Concurrent Collections (“CnC”) programming model as a base platform, in part because it embodies the spirit of this approach. More specifically, CnC has a nice way of representing “all possible parallel execution schedules,” from which we could then imagine tuning or searching to find an especially good one for a particular system. Aparna’s IPDPS’10 paper (Optimizing and Tuning the Fast Multipole Method for State-of-the-Art Multicore Architectures, Aparna Chandramowlishwaran et al.) — a “best paper” winner, by the way! — shows off some of our early and successful experiences with CnC.

It also seems clear that, in yet another 80s comeback, vectorization is re-emerging in its importance. Think much larger SIMD/SSE units. My student, Cong Hou, is thinking about the problem of autotuning in that context as well.