This month’s HPC Rock Star is Marc Snir. During his time at IBM, Snir contributed to one of the most successful bespoke HPC architectures of the past decade, the IBM Blue Gene. He was also a major participant in the effort to create the most successful parallel programming interface ever: MPI. In fact Bill Gropp, another key person in that effort, credits Snir with helping to make it all happen, “The MPI standard was the product of many good people, but without Marc, I don’t think we would have succeeded.”
Today Snir is the Michael Faiman and Saburo Muroga Professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign, a department he chaired from 2001 to 2007. With a legacy of success in his portfolio, he is perhaps busier today than ever as the Associate Director for Extreme Scale Computing at NCSA, co-PI for the petascale Blue Waters system, and co-director of the Intel and Microsoft funded Universal Parallel Computing Research Center (UPCRC). Trained as a mathematician, Snir is one of the few individuals today shaping both high end supercomputing and the mass adoption of parallel programming.
Marc Snir (Erdos number 2) finished his PhD in 1979 at the Hebrew University of Jerusalem. He is a fellow of the American Association for the Advancement of Science (AAAS), the ACM, and IEEE. In the early 1980s he worked on the NYU Ultracomputer Project. Developed at the Courant Institute of Mathematical Sciences Computer Science Department at NYU, the Ultracomputer was a MIMD, shared memory machine whose design featured N processors, N memories, and an N log N message passing switch between them. The switch would combine requests bound for the same memory address, and the system also included custom VLSI to interleave memory addresses among memory modules to reduce contention.
Following his time at NYU, Snir was at the Hebrew University of Jerusalem from 1982-1986, when he joined IBM’s T. J. Watson Research Center. At Watson he led the Scalable Parallel Systems research group that was responsible for major contributions — especially on the software side — of the IBM SP scalable parallel system and the IBM Blue Gene system. From 2007 to 2008 he was director of the Illinois Informatics Institute, and he has over 90 publications that span the gamut from the theoretical aspects of computer science to public computing policy. Microsoft’s Dan Reed has said of Snir that, “Marc has been one of the seminal thought leaders in parallel algorithms, programming models and architectures. He has brought theoretical and practical insights to all three.”
As readers of this series will know, a Rock Star is more than just the sum of accomplishments on a resume. We talked with Dr. Snir by email to get more insight into who he really is, what he thinks is important going forward, and what has made him so influential.
insideHPC: You have a long history of significant contributions to our community, notably including contributions to the development of the SP and Blue Gene. I was familiar with your MPI work, but not the SP and BG work while you were at IBM. Would you talk a little about that?
Marc Snir: The time is late 80’s. The main offering of IBM in HPC was a mainframe plus vector unit — not too competitive. Monty Denneau in Research had a project to build a large scale distributed memory system out of Intel 860 chips called Vulcan. His manager (Barzilai) decided to push this project as the basis for a new IBM product. This required changes in hardware (to use Power chips) — ironically, the original 860 board became the first NIC for the new machine — and also required a software plan, as well as a lot of lobbying, product planning, and so on. I got to manage the software team in Research that worked in this first incarnation of this product, developing communication libraries (pre-MPI), the performance visualization tools, a parallel file system, and other aspects of the final system.
There was a lot of work to convince the powers that be in IBM to go this way, because at the time IBM mainframes where still ECL, a lot of joint work with a newly established product group to develop an architecture and a product plan, and to do the first development in Research and transfer the code to development (Kingston and, later, Poughkeepsie). All of this work turned in the IBM SP, and the SP2 followed — it was the first real product and quickly became a strong sales driver for IBM. I continued to lead the software development in Research, where we did the first MPI prototype, created more performance tools, and did work on MPI-IO and various applications.
Blue Gene is a convoluted story (the Wikipedia entry is incorrect — I need to find time to edit it). At the time IBM had two hardware projects. One developing Cyclops, headed up by Monty Denneau. Cyclops was a massively parallel machine with heavily multithreaded processors and all memory on chip. The other project was to develop a QCD (quantum chromodynamics) machine based on embedded PPC processors under the direction of Alan Gara.
IBM Research was looking to push at the time highly visible, visionary projects. I proposed to take the hardware that Monty was building, with some modifications, and use it as a system for molecular dynamics simulation (if you wish, an early version of the Anton machine of D. Shaw). IBM announced, with great fanfare, a $100M project to build this system, and called it Blue Gene.
I coordinated the work on BG and directly managed the software development. In the meantime, Al Gara worked to make his system more general (basically, adding a general routing network, rather than nearest neighbor communications) and started discussing this design with Lawrence Livermore National Lab’s Mark Seager. Seager liked it and proposed to fund the development of a prototype. At that point, the previous Blue Gene became Blue Gene C while the system of Al Gara became Blue Gene L (for Light). After a year BGC was discontinued — or, to be, more accurate, heavily pared down (Monty has continued his work), and BGL evolved into BGP, and then BGQ. I helped Al Gara with some of his design — in particular with the IO subsystem, and with much of the software — and my team developed the original software both for Blue Gene C and for Blue Gene L.
insideHPC: Looking at your career thus far, do you have a sense that one or two accomplishments were especially significant professionally, either in terms of meeting a significant challenge or really spurring the community in a new direction?
Snir: I have had a fairly varied career. I started by doing more theoretical research. My first serious publication in 1982 is a 55-page journal article in the Journal of Symbolic Logic on Bayesian induction and logic (probabilities over rich languages, testing and randomness). It is still being cited by researchers in logic and philosophy. This is a long-term influence on a very small community with (I believe) deep philosophical implications, but no practical value.
Some of my early theory work has been somewhat ahead of its time, and continue to be cited long after publication. A paper on how to ensure sequential consistency in shared memory systems (Shasha and Snir) has been the basis for significant work for the compilation of shared memory parallel languages. I recently learned that a paper I published in 1985 on probabilistic decision trees is quite relevant to quantum computing — indeed it had some recent citations; I had to re-read it to remember what it was about. While many of my theory publications are best forgotten, some seem to have a long-term value.
My applied research has been (as it should be) much more of a team effort — so whatever credit I take, I share with my partners. Pushing IBM into scalable parallel systems (as we called them, i.e., clusters) was a major achievement. Basically, we needed to conceive a complete hardware and software architecture, and execute with a new product team — essentially work in startup mode. That probably was the most intensive time in my career. Pushing Blue Gene was also quite intense. I probably wrote down half of the MPI standard — that’s another type of challenge: thinking clearly about the articulations of a large standard and convincing a large community to buy into a design. As department head in CS at U of Illinois I faced quite intensive but quite different challenges: growing a top department (from 39 to 55 faculty in 6 years), improving education, and changing the culture. Getting Blue Waters up and running at NCSA (developing the proposal, nailing down the software architecture, pushing for needed collaborations with IBM, etc.) has a similar flavor. I think that I feel the need to push large projects.
I realize that’s more than two, but I like all of them. If I have to pick, I’d pick the IBM SP product, just because it was the most intensive project, and the one that required the most “design from scratch,” with little previous experience. It also was an unqualified success.
insideHPC: If you were to answer that same question about the one or two accomplishments that mean the most to you personally, are they the same?
Snir: Well, I have very successful children and very good friendships. This means a lot, personally. But I must confess that, family and friends aside, professional achievement is what I care about.
insideHPC: You’ve spent time as a manager and department head, and time as an individual contributor. Is there one of those roles that you think fits your personal style or the kind of contributions you want to make? Asked another way, some people add the most value by doing, and others by creating an environment in which other people can do: which fits you best?
Snir: Hard choice. As a manager or department head I have done much more of the latter, creating an environment where other people can do. My individual contributions, especially in theory, are the former. I would say that I prefer the latter; doing is more of a hobby, a way not to loose contact with reality. Getting others to do is a way of achieving much more.
insideHPC: You’ve talked about the need for effective parallelism to be accessible by everyone, but some argue that parallel programming is fundamentally hard and that you can either have efficient execution or ease of expression, but not both. Do you agree? Is this purely a software and tools problem, or is there a hardware component to the answer?
Snir: Processors have become more complex over the years, and software has not been too successful in hiding this complexity: It is increasingly easy to “fall of the cliff” and see a significant performance degradation due a seemingly innocuous change in a code. Parallelism is one additional twist to that evolution, there is no doubt about it. Small changes in a code can make the difference between fully sequential and fully parallel. Also, there is no doubt that there is a tradeoff between performance and ease of programming: people that care about the last ounce of performance (cache, memory, vector units, disk) have to work hard on single core systems and slightly harder on multicore systems. On the other hand, parallelism can be used to accelerate easy to use interfaces — e.g., Matlab, or even Excel, and can be used for bleeding-edge HPC computations.
The only fundamentally new thing is that application developers that want to see a uniprocessor application run faster from microprocessor generation to another need to learn now about parallelism. This is a new (large) population of programmers, and this is the focus of UPCRC Illinois.
insideHPC: Parallelism (of the kind exposed to developers) at much less than supercomputing scale is a relatively new thing for developers. For decades the majority of applications have been developed for desktop boxes, with very few people working on software for large scale parallel execution. Today we have parallelism even in handheld devices, and the high end community is contemplating O(1B) threads in a single job. Is there a chance that the work to develop tools for “commodity” parallel programming will make high end programming easier, or are these fundamentally different communities? If different, what are some of the essential differences?
Snir: The HPC software stack has been always developed by extending a commodity software stack: OS, compiler, etc. Now, the HPC software stack will be built by extending a (low-end) parallel software stack. I am inclined to believe that this will make the extension easier. There is also much cloud computing technology to be reused; i.e., system monitoring and error handling in large systems. Not much of this has happened, and I expect that the effect will be relatively marginal. As for the essential differences, this reminds me of the famous, apocryphal dialogue between Fitzgerald and Hemingway:
Fitzgerald: The rich are different than you and me.
Hemingway: Yes, they have more money.
Large machines are different because they have many more threads; HPC is different from cloud computing because its applications are much more tightly coupled. Sufficient quantitative differences become qualitative at some point.
insideHPC: I have been challenged at a couple events where I have spoken lately about the necessity of getting to exascale, and the draining effect it is having on computational funding for other projects. Is it necessary that we push on to the exascale? If so, why not take a longer trajectory to get there. Why is the end of this decade inherently better for exascale than the middle of the next?
Snir: Good question, and a question one might have asked at any point in time. It is not for supercomputing aficionados to make the case for exascale in 2018 or 2030; it is up to different application communities to make the case of the importance of getting some knowledge earlier. Having more certainty about climate change and its effects earlier by a few years may be well worth a couple of billion dollars — but this is not an arithmetic I can make; similarly for other applications.
There is another interesting point: Moore’s law is close to an inflection. ITRS (the International Technology Roadmap for Semiconductors) predicts a slowdown (doubling every 3 years) pretty soon; nobody has a path to go beyond 8 nm. Given that 8 nm is only a few tens of silicon atoms, we may be hitting a real wall next decade. There is no technology waiting to replace CMOS, as CMOS was available to replace ECL. This will be a major game changer for the IT industry in the next decade: The game will not be anymore finding applications that can leverage the advances of microelectronics, but getting more computation out of a fixed power and (soon) transistor budget. I call research on exascale “the last general rehearsal before the post-Moore era.” Exascale research will push on many of the research directions that are needed to increase “compute efficiency.” Therefore, I believe it is important to push this research now.
insideHPC: Thinking about exascale, there seems to be broad agreement that it isn’t practical to build such a system out of today’s parts because of energy impracticalities. But when it comes to programming models, some people seem to favor an incremental evolution of the same model we use today (communicating sequential processes with something like MPI), while others want to totally start over (e.g., Sterling’s ParalleX work). I’ve been personally surprised by how well the current model extended to the petascale. What are your thoughts about evolution versus revolution in exascale programming approaches?
Snir: When I was involved with MPI almost 20 years ago, I never dreamed it would be so broadly used 20 years down the road. Again, this is not a black and white choice: One can replace MPI with a more efficient communication layer for small, compute intensive kernels while continuing to use it elsewhere; one can take advantage of the fact that many libraries (e.g., Scalapack) are hiding MPI behind their own, higher-level communication layer, to re-implement their communication layer on another substrate; one can preprocess or compile MPI calls into lower-level, more efficient communication code. One can use PGAS languages which, essentially are syntactic sugar atop one-sided communication. We shall need to shift away from MPI for a variety of reasons that include performance (communication software overhead), an increasing mismatch between the MPI model and the evolution of modern programming languages, the difficulties of working with a hybrid model, etc. The shift can be gradual — MPI can run atop ParalleX. But we have very few proposals so far for a more advanced programming model.