Entries filed under “Featured Stories”

The historical archive of exclusive in-depth articles written by insideHPC’s editorial staff that you’ll find only at insideHPC.com.

Review: Introduction to Concurrency in Programming Languages

Introduction to Concurrency in Programming Languages
by Matthew J. Sottile, Timothy G. Mattson, and Craig E. Rasmussen
Chapman and Hall/CRC Press (2009)

ISBN 1420072137

I just recently finished reading Introduction to Concurrency in Programming Languages, one of the entries in CRC’s incredibly active Computational Science Series (“Incredibly active?” Yes: the series homepage lists 7 titles applicable to HPC coming in 2011, and a similar number published in 2010.)

I picked this book out of my large-ish stack of books waiting, mostly patiently, to be reviewed because I’m working on a research project these days that has to do with new models of parallel programming. I figured I’d get a decent grounding in what’s already been done, and why, but I was also concerned I’d get lost in computer science formalism. I was right, and wrong.

The authors are from a nice mix of the theoretical and the applied: Sottile is from the U of Oregon, Mattson is from Intel, and Rasmussen is from Los Alamos National Lab. All are active in supercomputing; you may recognize Mattson from his work on OpenMP or his book adapting the patterns concept to parallel programming. They set out for themselves as a goal “the motivation and definition of language level constructs for concurrency that have been designed, studied, and found their way into languages that we use today.”

Don’t let that turn you off. The authors don’t assume you are fresh out of a computer science languages course. They go to some pain to create a gentle slope back into the languages pool, with many explanations by way of analogy. “Language level constructs,” they explain, are things like loops that let the programmer express the concept of a loop without forcing her to manage the program counter explicitly. By extension, a language-level construct for parallelism might be a for loop that executes in parallel. Of course, in theory, the benefit of this approach is that the compiler is free to pick the implementation that works best and the programmer gets to worry about higher-level tasks, like whether the science is right.

In practice, however, we’ve never really gotten much further than first base with this approach. The literature is littered with attempts to separate specification from implementation in HPC that worked fine for some subset of special cases, but never really panned out in the general case (for example, HPF). But there is also quite a large body of advances that have made it into general use, and the authors cover those in this text in the hope that the way forward is in understanding why these worked, and building upon them.

What won’t you find?

This is not a book about OpenMP, MPI, and the other libraries and language tools that extend or augment traditional sequential programming languages like C and Fortran so that programmers can develop code that executes concurrently. These approaches are discussed, but only to set them apart from the true focus of the book: languages that include concurrency in fundamental operators as part of the language.

I’ve already mentioned one alluring, if difficult to attain, advantage of including concurrent constructs in the language: the programmer can raise his focus from the implementation details and focus on correctness. This can be, as I’ve already discussed, a somewhat suspect motivation for further work given the dismal history of the practice. Happily there are more convincing reasons to care. Whenever the compiler encounters a library call, say to MPI_Send, it has to assume that no optimizations are possible across that call. No code reordering, no optimizations on the calling side of the function to help the function execute more effectively, no elimination of variables that are never used again, and no optimization across processors to create more efficient code (for example, by coalescing many small messages on the programmer’s behalf). Promotion of concurrent constructs to be a member of the language itself, as the authors explain, puts all of this back into play, and puts the compiler back to work on behalf of the programmer.

This seems like a fairly small step that could have large payoffs in programmer productivity, and to my mind makes a convincing case for pursuing this work, no matter how jaded you are by previous grand plans to give over implementation to all-knowing compilers.

The lay of the text

As I’ve already mentioned, a clear focus in this book is on keeping the material accessible. The authors succeed at this brilliantly. I came to this book with the equivalent of a minor in computer science finished almost two decades ago, and almost no memory of language theory (other than the word “automata”, which I always liked as a word). That was enough grounding to enable me to easily keep up with the authors and still come away from the book with a deeper understanding of the concurrent world around me.

The text opens with a few chapters of introductory material on the core concepts in parallelism and concurrency. Then they move on in chapter 3 to concurrency control mechanisms, discussing the merits and demerits of techniques such as synchronization, locks, monitors, and so on. In chapter 4, “The State of the Art,” the authors cover libraries (and their limitations), along with message passing, explicitly controlled threads, and more advanced techniques such as transactional memory. Chapter 5 lays the groundwork for discussing and assessing the effectively of high-level language constructs for concurrency, including a nice subsection on cognitive dimensions which I found quite helpful.

Chapter 6 introduces the historical context (dataflow or ALGOL, anyone?), and chapter 7 pushes forward to modern day approaches with work on array notation, co-arrays, functional languages, and more. Chapter 8 addresses performance considerations.

Chapters 9 through 13 look at parallel algorithms and how they present themselves for effective concurrent implementation. After an introduction to parallel algorithms in general, each of the remaining chapters looks at a specific pattern of parallel processing (remember, Mattson is one of the authors of this book) and studies its implementation from several different perspectives. These chapters are quite helpful, and one can imagine them serving as focal points for practitioners expanding their knowledge or for students finishing out a semester with a big project.

The book has three appendices that provide additional material on three very different approaches to programming in parallel today: OpenMP, Erlang, and Cilk.

The last word

Ok, so you (probably) won’t be cracking this book open in front of a crackling fire at a ski lodge this winter. At least not if you want to go home with someone you didn’t know at the start of the trip. But if you are just jumping into the world of concurrent programming, or taking a more theoretical look at the approaches we’ve all been taking for granted for the past 20 years in an attempt to make things better, then this book is a great start.

The authors present a clear motivation for the relevance of continuing this work, and provide both the historical context and knowledge of present day practice that you’ll need to get off on the right foot. That they manage to do this while keeping the language clear and the text accessible is a tribute to the effort Sottile, Mattson, and Rasmussen put into the creation of the text.

Also posted in Book Review, Computing Research | 1 Comment

Whamcloud aims to make sure Lustre has a future in HPC

Brent Gorda

Brent Gorda

insideHPC had a chance this week to sit down with the executives of the newly minted Whamcloud, Brent Gorda [CEO] and Eric Barton [CTO]. Many of you probably know Brent from his work within the US Department of Energy supercomputing circles. He’s also very active in organizing various technical and community events for the IEEE/ACM Supercomputing conference series. Eric Barton brings 25 years of development experience in supercomputing to the Whamcloud team. He has been working on Lustre since he was brought in to stabilize its network stack when the project first received DOE funding. Most recently he was a Principle Engineer at Sun/Oracle where he served as Chief Architect of the Lustre group.

As you may know, Whamcloud’s business model is centered on the Lustre parallel file system. But what exactly does this mean? Lustre is an open source project, managed and held by the Oracle Corporation via their acquisition of Sun Microsystems. Given that Oracle’s core business isn’t dependent upon Lustre, many folks with large-scale Lustre deployments have been worried about the progression of the code base. We wanted to dig a little deeper and find out exactly what Whamcloud is up to with respect to our little friend Lustre.

During the interview, Brent Gorda summed up their intentions best: “Reduce the complexity and increase the community.”  Whamcloud intends to pour their own efforts into developing, hardening and improving what has become a real asset to the high performance computing community.  They plan on doing so via code contributions to the root Lustre source tree.  Unlike many other open source efforts that have become commercial products, they will not fork the source tree for their own endeavors.  This is extremely important in building and maintaining their idea of community: Lustre is everyone’s Lustre.

Eric Barton

Eric Barton

So how does affect their view of development?  I asked Eric Barton what their three top goals were with respect to development.  First, he said that Whamcloud is committed to working to improve the quality and stability of the code.  Without a stable code base to work from, scalability is simply a pipe dream.  This also implies de-prioritizing several of the features requested for the initial Lustre 2.0 release. 

The second major development goal is to begin preparing for the exascale deployments.  This one really threw me for a loop.  However, Eric is very grounded is his thought when he explains why.  Given that they want to always maintain the quality and stability of the file system, they need to begin to think intelligently about how to address systems with hundreds of thousands of nodes in the future.  They want to ensure that these features make it into the code base gracefully, as opposed to dropping the features in the community’s cage all at once.  Finally, he wants to make sure that the proper health and monitoring features gracefully make it into the source.  Exascale means nothing if the platform can’t be kept stable long enough to run an application.  A healthy system is a happy system.

So where is Oracle in all of this?  Brent and Eric were very adamant that they do not intend to directly compete with Oracle.  Oracle, via their inherited Sun support contracts, receives revenue based on the service and support of the Lustre file system.  They both indicated that Whamcloud will carefully manage its relationship and impact on Oracle. Whamcloud’s focus is Lustre on Linux for HPC — particularly the high end — whereas Oracle is more focused on commercial deployments. Whamcloud would rather be good stewards of the community and garner revenue through non-recurring engineering.

All in all, Whamcloud seems to be off to a raging start.  They’re growing on a daily basis [up to 10 employees at the time of the interview] and they’ve already had significant interest from partners and potential customers.  What was recently a damsel in distress with Lustre, now has its knight in shining armor with Whamcloud.

Also posted in Business of HPC, Enterprise HPC, HPC, HPC Hardware, HPC Software, Storage | 1 Comment

Rock Stars of HPC: John Shalf

John ShalfThis series is about the men and women who are changing the way the HPC community develops, deploys, and operates the supercomputers we build on behalf of scientists and engineers around the world. John Shalf, this month’s HPC Rock Star, leads the Advanced Technology Group for Lawrence Berkeley National Lab, has authored more than 60 publications in the field of software frameworks and HPC technology, and has been recognized with three best papers and one R&D 100 award.

Among the works he has co-authored are the influential “View from Berkeley” (led by David Patterson, and others), the DOE Exascale Steering Committee, and the DARPA IPTO Extreme Scale Software Challenges report that sets DARPA’s information technology research investment strategy for the next decade.

He also leads the LBNL/NERSC Green Flash project — which is developing a novel HPC system design (hardware and software) for kilometer-scale global climate modeling that is hundreds of times more energy efficient than conventional approaches — and participates in a large number of other activities that range from the DOE Exascale Steering Committee to Program Committee Chair for SC2010 Disruptive Technologies exhibit.

Shalf’s energy and dedication to HPC are helping to actively shape the future of HPC, and that’s what makes him this month’s HPC Rock Star.

insideHPC: How did you get started in HPC?

John Shalf: I spent a lot of time as a kid hanging out in the physics department and computing center at Randolph Macon College (RMC) in Ashland, where I grew up.  The professors there gave me (and other neighborhood kids) accounts on their IBM mainframe and Perkin-Elmer unix minicomputer, and access to the supply rooms behind the classrooms were there were hundreds of computing technology artifacts such as 3D stacked core memories from old IBM systems, and adders constructed using vacuum tube logic.   My friends and I spent a lot of time in the summers and after school in 4th thru 6th grade, exploring the back rooms and having the professors patiently explain what we were looking at and how it worked.  We also got our first taste of the UNIX operating system and CRT terminals, albeit we learned more about playing Venture (a text video game) than programming.

When I was about 11, Dr. Maddry offered to teach me how to build a computer in exchange for my help cleaning up his lab during the summer.  We actually had a race where he built a computer using a Z80 chip, and I built my computer using an 8080a.  Both of our computers had 128 bytes (yes bytes… not kilobytes) of memory, ran at 500khz (could run faster if you turned off the fluorescent lights In the room), and was programmed using a set of dip-switches on the board.  I still remember the 8212 tristate latches and the TTL discrete logic chips we needed to glue everything together. I had a blast building it, and just as much fun programming it, despite the rudimentary nature of the user interface, and the low-resolution of the display system (12 LED’s lined up in a row to show the data and the memory addresses).  After that, I become hooked on computer architecture and machine design.

In college, I took my first HPC course and become interested in parallel computation. Where we got accounts on the HPC systems (Cray vector and IBM) at the NSF supercomputing systems. I was particularly fascinated with Thinking Machines systems, but also learned a lot about dataflow computing. Around this time, I collected many old machines through surplus auctions as well to learn how they worked. I had quite a collection of PDP-8s and PDP-11s, and started the Society for the Preservation of Archaic Machines (SPAM). The chemistry department maintained many PDP’s for their experiments, so they became a resource for manuals, circuit diagrams, advice on machine repair, and a FORTH interpreter that ran on top of RSTS.

During this time, I also discovered Ron Kriz’s vislab, where I developed an interest in computer graphics and visualization as another way to interact with the HPC community.  Whereas I had been connected to computing only through my study of computer architecture and programming, the vislab and working on programming / optimization of material science codes for the Engineering Science and Mechanics (ESM) department opened me up to direct collaboration with science groups.  It was there that I learned that the interdisciplinary collaborations in HPC is where the rubber hits the road.  That the pursuit of answers to scientific grand-challenges required such broad-based collaborations is what makes “supercomputing” so exciting.

insideHPC: What would you call out as one or two of the high points of your career — some of the things of which you are most proud?

Shalf is seated, in the foreground

Shalf: My first real job in HPC was at NCSA, where I divided my time between NCSA’s HPC consulting group  (led by John Towns), Ed Seidel’s General Relativity Group, and Mike Norman’s Laboratory for Computational Astrophysics.  This was the golden years for NCSA and the NSF HPC Centers program as well.  NCSA Mosaic was just getting popular. I got to work on HPC codes on a variety of platforms.  The LCA was developing its first AMR codes (Enzo).  I got to learn how to work on virtual reality programs in the CAVE, and participated in national-scale high-performance networking test beds for the SC1995 IWAY experiment.  There was such a wide variety of computer architectures — Cray YMP, a Convex C3880, and a Thinking Machines CM5.
What an amazing time!

It was also a time of great transition because it was clear that our vector machines were going to be turned off eventually and replaced by clusters of SMPs (SGI’s and Convex Exemplars initially, followed by clusters).  It’s very similar to what is happening to the HPC community today as we transition to multicore.  It was an exciting time to start in HPC. There were new languages like HPF, messaging libraries like PVM and P4, and MPI. It was unclear what path to take to re-develop codes for these emerging platforms, so we tried all of the options using toy codes.  Everyone was busily creating practice codes to try out each of these emerging alternatives to re-develop their entire code base to survive this massive transition of the hardware/software ecosystem.

The first few implementations of the parallel codes worked, but revealed serious impediments to future/collaborative code development.  When Ed Seidel’s group moved to the Max Planck Institute in Potsdam Germany, Paul Walker and Juan Masso hatched a plan to create a new code infrastructure, called Cactus, to combine what we’d learned about how to parallelize the application efficiently and hide the MPI code from the application developers with clever software engineering to support collaborative/multidisciplinary code development. Cactus was so titled by Paul because it was to “solve thorny problems in General Relativity”.  I had a huge amount of fun developing components for the first versions of Cactus, which is still used today (www.cactuscode.org).  We had a huge sense of purpose and dedication to the development of Cactus infrastructure — creating advanced I/O methods, solver plug-ins, remote steering/visualization interfaces, etc. I continued to work with subsequent Cactus developers (Gabrielle Allen, Tom Goodale, Erik Schnetter, and many others) many years after leaving Max Planck to extend it for Grid computing and new computing systems. One of the first things the group did when I came to LBNL was to run the “Big Splash” calculation on the NERSC “Seaborg” system, of inspiraling colliding black holes. The calculation was ground-breaking, in that it disproved a long-held model for initial conditions for these inspiraling mergers, and its demonstration of what you could do with large scale computing resources ultimately spawned the DOE “INCITE” program.  The work with the Cactus team is one of the highlights of my career, even though there was a cast of hundreds contributing to its success.

The Green Flash project is also one of the projects that has been a lot of fun. Like Cactus, there are a large number of people working on different aspects of this multi-faceted project. I definitely love this kind of broad interdisciplinary work. We get to re-imagine computing architecture, programming models, and application design massively parallel chip architectures that we anticipate will be the norm by 2018. Our multi-disciplinary team is on the forefront of applying co-design processes to the development of efficient computing systems for the future. There are a lot of similarities between the move towards manycore/power-constrained architectures and the massive disruptions that occurred at the start of my career when everyone was moving from vectors to MPPs. It is exciting to have such an open slate for exploration, and a time for radical concepts in computer architecture to be reconsidered.

insideHPC: What do you see as the single biggest challenge we face (the HPC community) over the next 5-10 years?

Shalf: The move to exascale computing is the most daunting challenge that the community faces over the next decade.  If we do not come up with novel solutions, then we will have to contend with a future where we must maintain our pace of scientific discovery without future improvements in computing capability.

The exascale program is not just about “exaFLOPS,” it’s about the phase transition of our entire computing industry that affects everything from cell phones to supercomputers.  This is as big a deal as the conversion from vectors to MPI two decades ago.  We cannot lose sight of the global nature of this disruption — that is not just about HPC.  DARPA’s UHPC program strikes the right tone here.  We need that next 1000x improvement for devices of all scales.  Until recently we have been limited by costs and chip lithography (how many transistors we could cram onto a chip), but now hardware is constrained by power, software is constrained by programmability, and science is squeezed in between.  Even if we solve those daunting challenges, science may yet be limited by our ability to assimilate results and even validate those results.

I think there is a huge problem with us conflating success in “exascale” with the idea that the best science must consume an entire exascale computing system (the same is true to some extent with our obsession with scale for “petascale.”). The best science comes in all shapes and sizes.  The investment profile should be more balanced towards scientific impact (scientific merit, whether it is measured in papers or US competitiveness).  There is a role for stunts to pave the way to understand how to navigate the path to the next several orders of magnitude of scaling.  But the focus should definitely be more on creating a better computing environment for everyone — more programmable, better performing, and more robust.

We do have a tendency to say that the solution to all of our programmability problems is just finding the right programming model.  This puts too much burden on language designers and underplays the role of basic software engineering for creating effective software development environments.  Dan Reed once said that our current software practices are “pre-industrial,” where new HPC applications developers join the equivalent of a “guild” to learn how to program a particular kind of application.  Languages and hardware play a role (just as the steam engine played a role in the start of the industrial revolution), but software engineering and good code structures that clearly separate the roles of CS experts from domain scientists (frameworks like Cactus, Chombo, and Vorpal) and algorithm designers are also critical areas that often get under-appreciated in the development of future apps.

insideHPC: How do you keep up with what’s going on in the community and what do you use as your own “HPC Crystal Ball?”

John Shalf

Shalf: For hardware design and computer science, attending many meetings to interact with the community plays an essential role in gauging the zeitgeist of the community.  Given the huge amount of conflicting information, you need to talk to a lot of people to get a more statistical view of what technology paths are actually practical and what is just wishful thinking.  Getting someone to talk over a beer is always more insightful for the “HPC Crystal Ball” than simply accepting their PowerPoint presentation or paper at face value.  You have to constantly look at what other people are doing.

I’ve always enjoyed the SIAM PP (SIAM Conference on Parallel Processing for Scientific Computing) and SIAM CSE (SIAM Conference on Computational Science and Engineering) meetings as a great source for seeing ideas that are still “in progress.”  Normally, conferences have a strict vetting process for papers.  The presented work is usually thoroughly vetted and mostly complete.  There is little opportunity to drastically change the direction of such work.  However, the SIAM meetings support having people getting together through mini-symposiums to discuss work that is still in progress, and in some cases, is not fully baked.  This is where there is a real exchange of wild ideas and new ways of thinking about solving problems.  I think there is a role for both types of meetings, but I definitely see more of the pulse of the community in the SIAM mini-symposiums.

I also find that journals that are targeted more at domain scientists have a lot of information about future directions of the community. You quickly find out what is important and why.  More importantly, you learn the vocabulary to actually communicate with scientists about their work.

insideHPC: What motivates you in your professional career?

Shalf: Scientists like to do things because they are interesting.  Engineers like to do things that are “useful”.  I’m an engineer who likes to hang out with scientists to get a bit of both the “interesting” and the “useful.”  If I can do things that are both interesting AND useful, I’m very happy.

There is a recent article in Science Magazine (Vol 329, July 16, 2010) entitled “learning pays off.” It showed research that people who went into science because they were excited by the science, and not simply because they were good at math, were the most likely to continue in the field.  This makes total sense to me.  I’m just a science geek.  I’m not a scientist or physicist by training, but I love to read Science and Nature magazine from cover to cover whenever a new one arrives.  I just love to learn new things and explore.  Supercomputing is a veritable smorgasbord of ideas and different science groups.  The deeper I dive into my professional career, the more I learn and the more people I meet who have radically different perspectives on computing and in science.  It’s so much fun to learn something new every day.

It’s also fun trying to be the man-in-the-middle to communicate between people with disparate backgrounds.  Because of my diverse interests, my career has run the gamut from Electrical Engineering and computer hardware design, to code development for a scientific applications team, to computer science, and then back again to hardware design.  I remember the perspective I had when I was in each of those different roles (when in EE, I thought the scientists were all just bad programmers, and when working for the apps group, I thought the hardware architects were just idiots who would not listen to the needs of the application developers).  All of the interesting things happening in supercomputing are happening in the communication between these fields, and I love to be there, right in the middle.  This is why co-design has become such a popular term: it’s where all of the action is today.

insideHPC: Are there any people who have been an influence on you during your years in this community?

Shalf: Many, many people.  Nick Liberante, and English professor with uncompromising standards for excellence, taught me how to organize thoughts for writing, and the importance of memorization to facilitate that organization process. Ron Kriz taught me the value of persistence, collaboration across multiple disciplines, and to be undaunted by the challenges of new and rapidly evolving technologies. Ed Seidel has had a huge influence on my career by launching me into the HPC business and teaching me how far you can push yourself if you set seemingly unrealistic stretch goals. Ed and Larry Smarr, Maxine Brown, and Tom Defanti demonstrated the power of demonstrating the “seemingly impossible” is within our grasp through ambitious demonstrations like the SC95 IWAY. Donna Cox taught me the magic that can result from bringing both scientists and artists together (seemingly disparate groups) to create powerful communication media. Tom Defanti taught me the importance of articulating what I want to do (either by writing, or presenting to others) by saying “It’s not a waste of time if you have the right attitude. You are writing the future.”  He also showed me how we can reinvent ourselves to take on new challenges as he went from CAVE VR display environments and jumped in to high performance international optical networking.

insideHPC: What type of ‘volunteer’ activities are you involved in — both professional activities within the community, and personal volunteer activities.

Shalf: I would say I’ve gotten way over committed to SC-related volunteer activities.  In the past, I’ve spent some time helping with the LBL summer high-school students program.  This year, I’ve gotten completely immersed in participating in the program committees and organization of HPC-related conferences.  I’m on the program committees for IPDPS, ISC, ICS, and SC.  It’s fun to participate in the organization and planning of so many different conferences, but it’s a lot of work.  I would like to get back to working with the high school and undergraduate students to get them excited about this field.

John Shalf at home

insideHPC: How can we both attract the next generation of HPC professionals into the community, and provide them with the experience-based training that they will need to be successful.

Shalf: Well, first we should call it “supercomputing” rather than HPC if we want to attract new talent.  It sounds interesting when a high-school kid says they want to work on supercomputers.  If they say they want to work on High Performance Computing, they’ll have their underwear pulled up around their ears by the class bully in no time.

I ended up in this field because of the patience of a few physics professors at RMC when I was growing up.  There is no degree in supercomputing (or HPC) because the field is fundamentally interdisciplinary.  So you have to catch kids early to get them excited about the breadth of experiences that supercomputing can offer.

Closing Comments from John Shalf

We are back in a transition phase for our entire hardware/software ecosystem that is much like the transition we made to MPI.  Times of disruption are also great times of opportunity for getting new ideas put into practice. The world is wide open with possibilities. It’s a great time to be involved in computing research.

Also posted in HPC People, Rock Stars of HPC | 1 Comment

Q&A with John Shalf, chair of the Disruptive Technology showcase at SC10

I’ve mentioned the Disruptive Technologies event at SC10 a few times recently, and I thought it might be helpful for you guys and gals if we dug in and explored the event, its background, and what it’s all about in a little more depth. SC10 and John Shalf, from Lawrence Berkeley National Laboratory and the chair for Disruptive Technologies at this year’s show, sat down with us over email to talk with insideHPC about the plans for this year’s event.


insideHPC: What is a disruptive technology in the context of this event at SC?

SC10 logo

John Shalf: “Disruptive technology” refers to drastic innovations in current practices such that they have the potential to completely transform the high-performance computing field as it currently exists — ultimately overtaking the incumbent technologies or software tools in the marketplace. Disruptive Technologies, which has taken place as part of SC since 2006, examines new computing architectures and interfaces that will significantly impact the high-performance computing field throughout the next five to 15 years, but have not yet emerged in current systems.

insideHPC: Why is it relevant to the SC conference series?

Shalf: We examine Disruptive Technologies to bring awareness of those ideas that may be coming down the road so that feedback can be provided at critical stages resulting in the idea’s maximum value when the technology is adopted. An example of a disruptive technology is commodity cluster computing. Over time, it has transformed the hardware and software ecosystem for high performance computing.

The focus of the SC conference can often skew towards near-term considerations. However, the most exciting research ideas and boldest departure from business as usual comes from groups that are looking at ideas that are at least a decade out. Disruptive Technologies creates a venue to bring early stage technologies out into the open to challenge our pre-conceived notions on how things are done, to give us the breath of options and possibilities and foster discussion about the future of computing.

insideHPC: Can you tell us a little about the history of Disruptive Technologies as a formal part of the SC program?

Shalf: The Disruptive Technologies exhibit originated within the SC06 conference “Exotic Technologies” thrust area. A technology may start out as exotic, niche or situational, but when it provides a new and alternative option for HPC that others need to acknowledge then we have a disruption.

The exotic technologies exhibit was crafted as a venue to discuss ideas and technologies that are 10-15 years out into the future – too far out to be in a product. This is indeed where some of the most interesting ideas and boldest thinking in HPC technology are taking place. Many of the exhibits on the SC show floor have increasingly focused on the near term issues and existing product roadmaps. However, we know that the pathway for going from research to a product can be very long. Some of the most interesting ideas and technologies in development are not even on the product roadmap. In 2007, the exhibit’s name changed to “Disruptive Technologies” in homage to Clayton Christensen’s 1997 book, The Innovator’s Dilemma, where the term was first coined. The exhibit today continues to serve as a forum for examining technologies that may significantly reshape the world of high performance computing.

insideHPC: Is there a particular focus or theme that you are interested in showcasing this year?

Shalf: This year we want to highlight technologies, which can be hardware, software, communications, power or thermal management, that will be enabling exascale computing. By all accounts, the path to exascale computing will require many highly disruptive technology phase transitions. Yet there are many enormous hurdles that remain to be solved over the next decade to overcome power, performance, and costs to get to a practical exascale computing platform by 2018. Therefore, any technology that is able to overcome these hurdles will be “disruptive” by definition.

insideHPC: What technologies have been showcased in the past that were especially interesting or, with the benefit of hindsight, especially prescient?

Shalf: The early discussions of 3D chip stacking, low power non-volatile memory, and silicon photonics technology were particularly interesting in retrospect. At the time, they seemed like exotic packaging technologies, and their importance was not broadly recognized at the time. However, after 3 years of deep investigation of hardware constraints, and a more complete understanding of the technology options required to get to exascale, these technologies have emerged as being on the critical path to overcome many of the challenges for exascale computing.

insideHPC: Is there typically good international participation?

Shalf: We are very much looking to increase international participation. Just as the SC show has grown in size and in its international scope, we hope that the Disruptive Technologies program will also attract greater international attention.

insideHPC: When is the showcase open? Where can attendees find it?

Shalf: The Disruptive Technologies exhibit will have a highly visible location on the exhibit floor, so will consequently open and close with the SC10 exhibits. In addition, we will have a panel on Friday on the DARPA Ubiquitous High Performance Computing (UHPC) program, which is designed to foster new innovative projects to develop radically new computer systems that overcome the challenges of efficiency, dependability, and programmability anticipated in the exascale era. The panelists, who lead the selected UHPC teams, will discuss the perspective on how to address these challenges and their comprehensive hardware/software strategy they developed for the UHPC program.

insideHPC: How can interested folks be a part of this discussion?

Shalf: Companies or organizations that wish to participate in the Disruptive Technologies exhibit should apply at the submission website https://submissions.supercomputing.org/, by August 5 2010.

The submission form is very straightforward. We just need a title, contact information, 250 word abstract describing your technology and why it is disruptive, and 50 words to describe what you need to show off your invention (e.g. power hookups, projectors, how much table space in our booth). Finally, you can upload any supporting documentation describing your invention (extra information for the committee to evaluate the disruptiveness of your technology).

Anyone considering applying but who has questions can get in touch with us at disruptive-techs@info.supercomputing.org, and we will be happy to answer any questions about the exhibit or your submissions.

Also posted in Events, SC10, SC10 Feature Stories | Leave a comment

Industry experts form new Lustre startup

Following the official acquisition of Sun Microsystems by Oracle Corporation, there have been quite a few HPC industry pundits debating the eventual fate of the famed parallel file system Lustre.  Lustre made its name by anchoring super-scale computational centers such as Oak Ridge National Lab.  Considering Oracle’s core business model does not rely on technologies such as Lustre, the many folks who depend on Lustre for their high performance parallel file system have question marks beside support and continued development. Well, the skies have cleared: lets give a round of applause to Whamcloud.

What’s Whamcloud? Whamcloud is a new venture-backed startup that emerged from stealth mode this morning dedicated to filling the gap for future Lustre development and support.  Their business model is clear, concise and quite refreshing from a startup company in HPC.  As a company, they have three goals:

  1. Whamcloud will combine the world’s leading HPC and storage talent to evolve the state of parallel storage with a strategic focus on the most scalable applications, specifically high performance and cloud computing
  2. Whamcloud will contribute and evolve open source file storage technologies, including the Lustre file system, upon an open-source Linux foundation using Linux storage technology
  3. Whamcloud will focus on enabling open source Lustre storage technology in the industry by opening up file system support to the whole industry, with a hardware-agnostic storage certification and support program

So why the enthusiasm? Whamcloud has assembled a serious team of industry experts.  Not the kind with the typical “CEO of Foo” resumes.  These experts are real HPC gurus.  So who’s lurking the halls of Whamcloud?  Brent Gorda will hold the title of CEO.  Those of you familiar with the Department of Energy know that Brent has been around big HPC for quite some time.  He’s also a former contributor to the Supercomputing Cluster Challenge.  Eric Barton, CTO, was most recently a Principal Engineer at Sun/Oracle and Chief Architect with the Lustre group.  Robert Read, Whamcloud’s Principal Engineer, was also formerly at Sun/Oracle leading the charge for Lustre 2.0 development.

What’s not to like? You have two of the leading visionaries behind recent development efforts in Lustre and one of the thought leaders in Lustre implementation and operations.

There is tremendous demand for leadership from a professional engineering organization that is focused on evolving Lustre for the next 10 years of HPC and cloud storage,” said Brent Gorda, Whamcloud CEO. “History has proven that hardware-oriented purchases of open-platform file storage technologies are disruptive to the growth of scale-out storage technology. First and foremost, Whamcloud will ensure broad and continued international adoption of these technologies through a hardware-agnostic customer approach, across a broad array of data-hungry markets.”

Folks, this is one to keep and eye on.  Lustre is and will continue to be a vital piece of the HPC puzzle.  As larger systems and scalable applications begin to become the norm in HPC, the pressures of I/O and storage will continue to increase.  Whamcloud is well positioned to take Lustre to the next stage of scalability and performance.

Also posted in Business of HPC, Enterprise HPC, HPC, HPC Software | 1 Comment

Vuduc wins NSF CAREER Award to make HPC better “by any means necessary”

In early June the NSF announced that Georgia Tech’s Richard Vuduc received an NSF CAREER Award for his work in tuning software to run on parallel systems. From the NSF website

NSF logoThe Faculty Early Career Development (CAREER) Program is a Foundation-wide activity that offers the National Science Foundation’s most prestigious awards in support of junior faculty who exemplify the role of teacher-scholars through outstanding research, excellent education and the integration of education and research within the context of the mission of their organizations. Such activities should build a firm foundation for a lifetime of leadership in integrating education and research.

The name of his proposal, “Autotuning foundations for exascale systems”, attracted my attention and Rich agreed to tell us a little about himself, his work, and this prestigious award.


insideHPC: First, can you tell the readers a little about yourself? What’s the 100 word bio of Rich Vuduc?

Richard VuducRich Vuduc: I am an assistant professor at Georgia Tech in the School of Computational Science and Engineering, which is (Shameless Plug Alert) one of the country’s few full-fledged academic departments devoted to the systematic study, creation, and application of computer-based models to understand and analyze natural and engineered systems. HPC is a major research and teaching focus in this kind of department, because computational scientists often care a great deal about effective use of parallelism in large systems. My research lab, The HPC Garage, is looking at automating and simplifying the analysis, programming, tuning, and debugging of software for emerging and future parallel machines.

On a more personal note, I am Vietnamese-American and my favorite TV show is “The Wire.” For TV skeptics, The Wire is proof that a TV series can be great art!

insideHPC: Looking at your web pages, it seems like you are, well, more fun than most of the profs I remember. “HPC Garage” for example. Is that a conscious effort on your part to engage more creative people, or just a natural extension of your personality?

Vuduc: Thanks, though I don’t know if “more fun” necessarily means “better research and teaching.”

I went to grad school and did my postdoc in the Bay Area, and am greatly inspired by the famous Hewlett-Packard Garage—so, too, is my lab a small team of creative hands-on tinkerers with limited resources and big dreams of building better, well, instruments and “calculators” for scientific advancement.

insideHPC: Your research area is in tools for getting better performance out of high end systems by software methods rather than human intervention. Can you generally describe your work in this area? Is any of it part of a library readers may be using? How does it fit in the context of other efforts, like ATLAS?

Vuduc: Yes, our goal is to simplify the process of achieving truly high performance, “by any means necessary,” if I may pay small tribute to my radical Berkeley roots. Accomplishing this goal might mean giving parallel programmers an auto-magic toaster that makes slow code fast. However, I would also be happy with more modest achievements, like distilling useful new performance principles or practices; making productive programming models fast; or providing more insight into what architectures work for particular interesting and important classes of applications, and why.

People who recognize my name probably know it from my early work in the area of autotuning on a library called OSKI, the Optimized Sparse Kernel Interface, which was developed while I was a graduate student “bebopper” in Jim Demmel’s and Kathy Yelick’s BeBOP group at Berkeley. (OSKI is also the Cal mascot. Go Bears!) OSKI is like Clint Whaley’s well-known ATLAS library, but is for sparse matrices rather than dense ones. The methodology is different in the sparse case, where one might not only tune the code, but also change the data structure at run-time, depending on the input matrix. Sam Williams (LBNL) greatly extended the OSKI techniques for multicore, and Jee Choi, one of my students, has some cool extensions for GPUs. As for sequential OSKI, I know Mike Heroux at Sandia has an effort to put wrappers around it for Trilinos.

These days, my lab is looking at autotuning techniques for a broader variety of interesting irregular and highly-adaptive computations, both in statistical machine learning (jointly with Alex Gray at GT) and for tree-based n-body problems (jointly with George Biros, also at GT).

insideHPC: Thinking specifically about your CAREER award, could you briefly talk about the award, what it is, what it means for you professionally, and what it means for you personally.

Vuduc: The CAREER award is an angel investment! I am extremely grateful that there are people willing to take a chance on my lab’s work and on my teaching (probably a bigger risk, the latter). Receiving the award means I have both the duty and the privilege to do something impactful.

It’s also a nice nod to my senior faculty mentors at GT, David Bader and Richard Fujimoto. Their efforts and advice have not been lost.

insideHPC: Your proposal is called “Autotuning foundations for exascale systems” — can you talk about the work you plan to do?

Vuduc: In perhaps overly basic terms, we hope to simplify programming and tuning on future exascale systems using autotuning techniques.

The proposal has two major research thrusts, one that explores analytical and statistical performance models to guide tuning, and another that explores tuning in emerging dataflow-like programming models. In both cases, we want methods that work on (a) the kinds of sparse, irregular, adaptive computations that I’ve been studying for some time now and that are a particular challenge to scale; and (b) the kinds of systems we can expect to see at exascale, which I am told will have “absurdly heterogeneous manycore nodes.” Both thrusts build on collaborations with Kath Knobe and C.-K. Luk, both at Intel. If we are successful, we will contribute to a goal that folks like David Bailey (LBNL) and Robert van de Geijn (UT Austin) sometimes refer to as one of developing a “science” of performance programming and engineering. That’s what “foundations” refers to.

Like all CAREER proposals, there is also an integral educational thrust tied to the research. In my case, the gist is to design and implement a year-long lab practicum, called The HPC Garage Practicum, that is a true interdisciplinary team-based competition, aimed at early-stage graduate students. The competition is to develop the most scalable code that answers real-world scientific questions; think of the famed Gordon Bell and X-Prize competitions. The basic inspiration arose in conversations with Pablo Laguna, Deirdre Shoemaker, and George Biros at GT. The approach is in the style of the GT School of CSE’s mission, to train the next generation of computational scientists in interdisciplinary teamwork.

By the way, if any corporations would like to donate prizes for the winning teams in this effort, we are soliciting.

insideHPC: Is this work a “scaling up” of the earlier work you’ve done, or are there specific things that you’ll need to change to address the challenge of running on exascale class machines?

Vuduc: We are scaling up, but not just the platform—-we are also working in larger “algorithmic contexts.” I mean that whereas my earlier work focused on relatively compact kernels, my lab these days is looking at autotuning progressively more complex multiple-kernel solvers, with an eye toward large applications. This requires working more closely with domain scientists and compiler people, like my former postdoc mentor, Dan Quinlan at LLNL. The work my students, Aparna Chandramowlishwaran and Aashay Shringarpure, have done for the fast multipole method on multicore- and GPU-based distributed memory systems is a great first example.

insideHPC: How do you go about designing software for a class of machines that not only hasn’t been built, but for which there isn’t even a design consensus yet?

Vuduc: It’s always a difficult problem, but a “classical” approach is to change the program representation, as suggested, for instance, by Jeff Bilmes (UW) and Krste Asanovic (UCB) in their PHiPAC project. In particular, rather than writing a specific program, you write a program generator that can produce many different versions of the program. The generator might encode the generation of entirely different algorithms. Perhaps the most aggressive and successful examples of this approach today are the SPIRAL (Markus Pueschel at CMU) and FLAME (van de Geijn) projects. It’s not easy to do this but is in my view a promising way forward.

In my CAREER proposal, part of what we plan to do is work with Kath (Knobe at Intel) to use her Concurrent Collections (“CnC”) programming model as a base platform, in part because it embodies the spirit of this approach. More specifically, CnC has a nice way of representing “all possible parallel execution schedules,” from which we could then imagine tuning or searching to find an especially good one for a particular system. Aparna’s IPDPS’10 paper (Optimizing and Tuning the Fast Multipole Method for State-of-the-Art Multicore Architectures, Aparna Chandramowlishwaran et al.) — a “best paper” winner, by the way! — shows off some of our early and successful experiences with CnC.

It also seems clear that, in yet another 80s comeback, vectorization is re-emerging in its importance. Think much larger SIMD/SSE units. My student, Cong Hou, is thinking about the problem of autotuning in that context as well.

Also posted in Computing Research, HPC Education and Training, HPC People, HPC Software, Tools | Leave a comment

Cray’s Baker pops out of the oven as company “re-learns” how to make great systems

It’s been nearly a month since Cray took the code name away from Baker and announced its official designation — the XE6 — and made it an “official” product (although it is not yet shipping). This is an important launch for the company that doesn’t have much room for error.

Cray logoCray has hundreds of millions of dollars tied up in orders for a product that isn’t scheduled to ship until Q4 of this year, when the final silicon for its new interconnect switch will be complete. Many of those orders are $40+ million dollar deals, with substantial penalties for late delivery. Previous iterations of the XT line have mostly been refinements to earlier designs, but Baker is a significant change of technology with a lot of new system software to go with the updated silicon. Delays are obviously a big part of the risk — some bug that prevents the company from shipping anything — but in some ways an even worse scenario is a product that ships and appears to be working until it is assembled and tested at large scale, leaving customers hanging while the company scrambles to debug in the field.

Just before launch I talked with Barry Bolding, Cray’s veep for scalable systems (the high-end stuff, contrasted with the lower-end CX line), to get a feel from him about where the product is headed and what it means for the company’s bottom line.

Expecting twins

The big news with the XE6 is the interconnect. Earlier generations of Cray’s high-end, AMD-based systems were based on the SeaStar interconnect, a custom interconnect that ran on a chip of its own. That interconnect went through two major generations, through the XT6 line. The XE6 introduces a new interconnect called “Gemini” which is actually an early version of the Aries interconnect being fielded as part of Cray’s Cascade DARPA system (according to Bolding, Cray has now decided that interconnect chips are going to be named after constellations, in case you are curious about these things).

Gemini features two network ports (two, twins, “Gemini”), and supports the same 3D torus topology that the SeaStar has. This means that customers with XT5 and XT6 systems will be able to upgrade their cabinets to XE6s just by replacing the communication chip (but because of backplane limitations XT3 and 4 customers are out of luck). If you are planning an in-place upgrade, this is the generation for you: Cascade changes the interconnect topology entirely, and at that point your only upgrade options will involve 18-wheelers and a forklift.

Current XT owners will appreciate Gemini for its added resilience, especially for its warm swap capability. If you lose a node you’ll be able to swap it out without rebooting the whole system to rebuild the network — a Good Thing. Gemini also adds a hardware-supported global memory space back into Cray’s product line. Those readers who developed for the T3E once upon a time will be right at home, with one-sided communications that don’t have to go through the OS. Bolding also says that future generations of the interconnect will include hardware support for collective operations, a feature that is becoming increasingly common in the network layer of commodity-built clusters.

According to Bolding, latency is “less than 2 microseconds,” and messaging rates on Gemini are about 100x those of SeaStar with about 150 million messages per second across the chip. Gemini also supports adaptive routing to move traffic away from congested routes, and uses a high radix router with multiple lanes per route between any two connections. This means that a hardware problem can shut down a lane but still leave the channel open.

So how’s it look?

“We already having several multiple cabinet systems running in-house,” says Bolding when I ask him how confident he is in the new hardware. He goes on to say that while this is a good sign, it is still possible for unexpected things to happen at scale, which is consistent with the cautiously optimistic stance of the rest of Cray’s leadership about the launch.

But Cray’s solid record of innovation and phenomenal fiscal discipline must have built up a lot of goodwill with customers. “What I am most pleased about,” Bolding explains, “is the breadth of customer buy in. There is a rich diversity in both the range of workloads and the size of systems that customers have ordered.” According to Bolding, Cray has booked many mid-sized (10-70 cabinet) systems already, in addition to the high profile, high dollar awards to the likes of the DOE and the DOD. This is important because bugs in these systems sometimes only show up at large scales; shipping smaller systems that work just fine will enable the company to realize some revenue from the new system even if problems do show up in the big boys. “This puts us in a totally different situation financially than we were in in 2008 when ORNL’s 200 cabinet system was the make-or-break deal for us.”

(Re)learning how to build great systems

For Bolding there is more to this diversity than just a little risk management for the 2010 financials. He feels the shift reflects a deeper change in the market’s acceptance of Cray’s technologies, “We’ve gone from 2-3 partners driving our business to have a diverse range of true customers, people who don’t want to help create great technology, but who want to just buy and use it,” he says.

He credits the shift to hard work under the covers on the way Cray get things done inside the corporate walls. “We designed a good system beginning with the XT3 and XT4, but we listened carefully to those customers, and for the XT5 and XT6 systems we really focused on optimizing and improving both the hardware and software in those systems in response to feedback,” Bolding says. “Cray has added a lot around 6 Sigma and internal processes to make great systems — in many ways, as a company we have re-learned how to build great systems.”

As with the previous scalable systems, customers can expect a mid-sized XE6m system to be announced by Q1 2011. Gemini won’t be available right away on this system, however Bolding explains that because the m’s top out at 6 cabinets, SeaStar is just fine for those systems. Once Gemini is released for the m series customers will be able to upgrade their XT6m systems to XE6m.

Also posted in Business of HPC, HPC Hardware | 2 Comments

Review: Parallel Processing for Scientific Computing

Cover of bookParallel Processing for Scientific Computing

edited by Michael A. Heroux, Padma Raghavan, and Horst D. Simon
SIAM (2006)

ISBN 0898716195

I just finished reading Parallel Processing for Scientific Computing, one of the most recent volumes to join SIAM’s Software, Environments, and Tools series of scientific computing books (Jack Dongarra is the editor in chief of the series). Parallel is organized around the themes and problems presented at the Eleventh SIAM Conference on Parallel Processing for Scientific Computing, held in San Francisco in 2004 (as fate would have it, I’m writing this review in a hotel in San Francisco); even though 2004 seems like a long time ago, the editors and contributors took care in the creation of this book, and it remains timely today.

The book includes 20 articles from 91 contributors organized into 4 sections. The authors are a computational who’s who — you are sure to recognize names like Simon, Gropp, Lumsdaine, Snavely, Stevens, Bader, Foster, Bailey, and more — and each section includes a mix of both practical and pragmatic articles.

For example the first section, Performance Modeling, Analysis, and Optimization, opens with an article by Jesús Labarta and Judit Gimenez on the changes in structure and implementation that are needed to move performance analysis from an art to a first class science. This article takes a step back and looks at the big picture, but still manages to stay grounded via the authors’ references to their attempts to implement some of their ideas in real software. This is followed up by a survey article that covers much of the state-of-the-practice in architecture-aware scientific computation, written as a collection of mini-articles on specific projects. The section is rounded out by a chapter on specific experiences getting to high performance on an early IBM Blue Gene, and then looks forward with a chapter on application performance modeling for ultra-scale systems.

The entire book follows this structure, with each section featuring a mix of the pragmatic and the theoretical, the strategic and the practical.

Section 2, Parallel Algorithms and Enabling Technologies, covers partitioning and load balancing (with a great section on partitioning in parallel contact/impact applications), non-PDE based computations, adaptive mesh refinement, multigrid, solvers, and fault tolerance. The fault tolerance chapter was of particular interest to me in this section. It is well-written, and a great place to start if you are just beginning to think about one of the main problems facing the practical use of exascale systems in the near future.

Section 3, Tools and Frameworks for Parallel Applications, is a well-written survey by William Gropp and Andrew Lumsdaine that would serve as an excellent primer for a scientist wanting to stand up a cluster and get busy using it to run codes. Other articles in this section include a survey of parallel linear algebra software by Eijkhout, Langou, and Dongarra, as well as two chapters that point to a possible future for HPC software development in component-based software systems and frameworks for scientific computing.

Finally the last section, Applications of Parallel Computing, walks through challenging broad categories of HPC applications. The chapters here include a treatment of PDE-constrained optimization, parallel mixed-integer programming, multicomponent simulations, and computational biology, all with an emphasis on parallel aspects.

The text closes with a capstone article by the editors that looks at the challenges and opportunities for computational science.

The last word

I think the editors have done an excellent job of herding a collective view of scientific computing from what would otherwise have been just another collection of articles. Even though the book is four years old now, and even though the conference that inspired it was held six years ago, Parallel remains quite up to date in some aspects of its outline of the start-of-the-art in computing. Even in those areas where it is beginning to show its age (for example, the Blue Gene performance tuning chapter), the book remains an excellent starting point for more research.

The clear, jargon-free writing style makes for an easy read, and the references alone make exploring this text well worth your time. They are often quite complete: I found several chapters with 100 or more citations that readers can explore to develop a fuller understanding of a topic of particular interest. If you are just starting graduate studies in HPC and want to get a broad overview of the many facets of research in our field, then this book is an outstanding starting place. And if you are a seasoned practitioner, I think you’ll find the text provides a valuable point of view on a broad range of topics, with references that should keep you busy well into many sleepless summer nights.

Be sure to check out the other book reviews we’ve done here at insideHPC.

Also posted in Book Review, Computing Research | 1 Comment

Rock Stars of HPC: Marc Snir

Marc Snir

This month’s HPC Rock Star is Marc Snir. During his time at IBM, Snir contributed to one of the most successful bespoke HPC architectures of the past decade, the IBM Blue Gene. He was also a major participant in the effort to create the most successful parallel programming interface ever: MPI. In fact Bill Gropp, another key person in that effort, credits Snir with helping to make it all happen, “The MPI standard was the product of many good people, but without Marc, I don’t think we would have succeeded.”

Today Snir is the Michael Faiman and Saburo Muroga Professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign, a department he chaired from 2001 to 2007. With a legacy of success in his portfolio, he is perhaps busier today than ever as the Associate Director for Extreme Scale Computing at NCSA, co-PI for the petascale Blue Waters system, and co-director of the Intel and Microsoft funded Universal Parallel Computing Research Center (UPCRC). Trained as a mathematician, Snir is one of the few individuals today shaping both high end supercomputing and the mass adoption of parallel programming.

Marc Snir (Erdos number 2) finished his PhD in 1979 at the Hebrew University of Jerusalem. He is a fellow of the American Association for the Advancement of Science (AAAS), the ACM, and IEEE. In the early 1980s he worked on the NYU Ultracomputer Project. Developed at the Courant Institute of Mathematical Sciences Computer Science Department at NYU, the Ultracomputer was a MIMD, shared memory machine whose design featured N processors, N memories, and an N log N message passing switch between them. The switch would combine requests bound for the same memory address, and the system also included custom VLSI to interleave memory addresses among memory modules to reduce contention.

Following his time at NYU, Snir was at the Hebrew University of Jerusalem from 1982-1986, when he joined IBM’s T. J. Watson Research Center. At Watson he led the Scalable Parallel Systems research group that was responsible for major contributions — especially on the software side — of the IBM SP scalable parallel system and the IBM Blue Gene system. From 2007 to 2008 he was director of the Illinois Informatics Institute, and he has over 90 publications that span the gamut from the theoretical aspects of computer science to public computing policy. Microsoft’s Dan Reed has said of Snir that, “Marc has been one of the seminal thought leaders in parallel algorithms, programming models and architectures. He has brought theoretical and practical insights to all three.”

As readers of this series will know, a Rock Star is more than just the sum of accomplishments on a resume. We talked with Dr. Snir by email to get more insight into who he really is, what he thinks is important going forward, and what has made him so influential.


insideHPC: You have a long history of significant contributions to our community, notably including contributions to the development of the SP and Blue Gene. I was familiar with your MPI work, but not the SP and BG work while you were at IBM. Would you talk a little about that?

Marc Snir: The time is late 80’s. The main offering of IBM in HPC was a mainframe plus vector unit — not too competitive. Monty Denneau in Research had a project to build a large scale distributed memory system out of Intel 860 chips called Vulcan. His manager (Barzilai) decided to push this project as the basis for a new IBM product. This required changes in hardware (to use Power chips) — ironically, the original 860 board became the first NIC for the new machine — and also required a software plan, as well as a lot of lobbying, product planning, and so on. I got to manage the software team in Research that worked in this first incarnation of this product, developing communication libraries (pre-MPI), the performance visualization tools, a parallel file system, and other aspects of the final system.

Marc Snir

There was a lot of work to convince the powers that be in IBM to go this way, because at the time IBM mainframes where still ECL, a lot of joint work with a newly established product group to develop an architecture and a product plan, and to do the first development in Research and transfer the code to development (Kingston and, later, Poughkeepsie). All of this work turned in the IBM SP, and the SP2 followed — it was the first real product and quickly became a strong sales driver for IBM. I continued to lead the software development in Research, where we did the first MPI prototype, created more performance tools, and did work on MPI-IO and various applications.

Blue Gene is a convoluted story (the Wikipedia entry is incorrect — I need to find time to edit it). At the time IBM had two hardware projects. One developing Cyclops, headed up by Monty Denneau. Cyclops was a massively parallel machine with heavily multithreaded processors and all memory on chip. The other project was to develop a QCD (quantum chromodynamics) machine based on embedded PPC processors under the direction of Alan Gara.

IBM Research was looking to push at the time highly visible, visionary projects. I proposed to take the hardware that Monty was building, with some modifications, and use it as a system for molecular dynamics simulation (if you wish, an early version of the Anton machine of D. Shaw). IBM announced, with great fanfare, a $100M project to build this system, and called it Blue Gene.

I coordinated the work on BG and directly managed the software development. In the meantime, Al Gara worked to make his system more general (basically, adding a general routing network, rather than nearest neighbor communications) and started discussing this design with Lawrence Livermore National Lab’s Mark Seager. Seager liked it and proposed to fund the development of a prototype. At that point, the previous Blue Gene became Blue Gene C while the system of Al Gara became Blue Gene L (for Light). After a year BGC was discontinued — or, to be, more accurate, heavily pared down (Monty has continued his work), and BGL evolved into BGP, and then BGQ. I helped Al Gara with some of his design — in particular with the IO subsystem, and with much of the software — and my team developed the original software both for Blue Gene C and for Blue Gene L.

insideHPC: Looking at your career thus far, do you have a sense that one or two accomplishments were especially significant professionally, either in terms of meeting a significant challenge or really spurring the community in a new direction?

Snir: I have had a fairly varied career. I started by doing more theoretical research. My first serious publication in 1982 is a 55-page journal article in the Journal of Symbolic Logic on Bayesian induction and logic (probabilities over rich languages, testing and randomness). It is still being cited by researchers in logic and philosophy. This is a long-term influence on a very small community with (I believe) deep philosophical implications, but no practical value.

Some of my early theory work has been somewhat ahead of its time, and continue to be cited long after publication. A paper on how to ensure sequential consistency in shared memory systems (Shasha and Snir) has been the basis for significant work for the compilation of shared memory parallel languages. I recently learned that a paper I published in 1985 on probabilistic decision trees is quite relevant to quantum computing — indeed it had some recent citations; I had to re-read it to remember what it was about. While many of my theory publications are best forgotten, some seem to have a long-term value.

My applied research has been (as it should be) much more of a team effort — so whatever credit I take, I share with my partners. Pushing IBM into scalable parallel systems (as we called them, i.e., clusters) was a major achievement. Basically, we needed to conceive a complete hardware and software architecture, and execute with a new product team — essentially work in startup mode. That probably was the most intensive time in my career. Pushing Blue Gene was also quite intense. I probably wrote down half of the MPI standard — that’s another type of challenge: thinking clearly about the articulations of a large standard and convincing a large community to buy into a design. As department head in CS at U of Illinois I faced quite intensive but quite different challenges: growing a top department (from 39 to 55 faculty in 6 years), improving education, and changing the culture. Getting Blue Waters up and running at NCSA (developing the proposal, nailing down the software architecture, pushing for needed collaborations with IBM, etc.) has a similar flavor. I think that I feel the need to push large projects.

I realize that’s more than two, but I like all of them. If I have to pick, I’d pick the IBM SP product, just because it was the most intensive project, and the one that required the most “design from scratch,” with little previous experience. It also was an unqualified success.

insideHPC: If you were to answer that same question about the one or two accomplishments that mean the most to you personally, are they the same?

Snir: Well, I have very successful children and very good friendships. This means a lot, personally. But I must confess that, family and friends aside, professional achievement is what I care about.

insideHPC: You’ve spent time as a manager and department head, and time as an individual contributor. Is there one of those roles that you think fits your personal style or the kind of contributions you want to make? Asked another way, some people add the most value by doing, and others by creating an environment in which other people can do: which fits you best?

Snir: Hard choice. As a manager or department head I have done much more of the latter, creating an environment where other people can do. My individual contributions, especially in theory, are the former. I would say that I prefer the latter; doing is more of a hobby, a way not to loose contact with reality. Getting others to do is a way of achieving much more.

insideHPC: You’ve talked about the need for effective parallelism to be accessible by everyone, but some argue that parallel programming is fundamentally hard and that you can either have efficient execution or ease of expression, but not both. Do you agree? Is this purely a software and tools problem, or is there a hardware component to the answer?

Snir: Processors have become more complex over the years, and software has not been too successful in hiding this complexity: It is increasingly easy to “fall of the cliff” and see a significant performance degradation due a seemingly innocuous change in a code. Parallelism is one additional twist to that evolution, there is no doubt about it. Small changes in a code can make the difference between fully sequential and fully parallel. Also, there is no doubt that there is a tradeoff between performance and ease of programming: people that care about the last ounce of performance (cache, memory, vector units, disk) have to work hard on single core systems and slightly harder on multicore systems. On the other hand, parallelism can be used to accelerate easy to use interfaces — e.g., Matlab, or even Excel, and can be used for bleeding-edge HPC computations.

Marc Snir

The only fundamentally new thing is that application developers that want to see a uniprocessor application run faster from microprocessor generation to another need to learn now about parallelism. This is a new (large) population of programmers, and this is the focus of UPCRC Illinois.

insideHPC: Parallelism (of the kind exposed to developers) at much less than supercomputing scale is a relatively new thing for developers. For decades the majority of applications have been developed for desktop boxes, with very few people working on software for large scale parallel execution. Today we have parallelism even in handheld devices, and the high end community is contemplating O(1B) threads in a single job. Is there a chance that the work to develop tools for “commodity” parallel programming will make high end programming easier, or are these fundamentally different communities? If different, what are some of the essential differences?

Snir: The HPC software stack has been always developed by extending a commodity software stack: OS, compiler, etc. Now, the HPC software stack will be built by extending a (low-end) parallel software stack. I am inclined to believe that this will make the extension easier. There is also much cloud computing technology to be reused; i.e., system monitoring and error handling in large systems. Not much of this has happened, and I expect that the effect will be relatively marginal. As for the essential differences, this reminds me of the famous, apocryphal dialogue between Fitzgerald and Hemingway:

Fitzgerald: The rich are different than you and me.
Hemingway: Yes, they have more money.

Large machines are different because they have many more threads; HPC is different from cloud computing because its applications are much more tightly coupled. Sufficient quantitative differences become qualitative at some point.

insideHPC: I have been challenged at a couple events where I have spoken lately about the necessity of getting to exascale, and the draining effect it is having on computational funding for other projects. Is it necessary that we push on to the exascale? If so, why not take a longer trajectory to get there. Why is the end of this decade inherently better for exascale than the middle of the next?

Snir: Good question, and a question one might have asked at any point in time. It is not for supercomputing aficionados to make the case for exascale in 2018 or 2030; it is up to different application communities to make the case of the importance of getting some knowledge earlier. Having more certainty about climate change and its effects earlier by a few years may be well worth a couple of billion dollars — but this is not an arithmetic I can make; similarly for other applications.

There is another interesting point: Moore’s law is close to an inflection. ITRS (the International Technology Roadmap for Semiconductors) predicts a slowdown (doubling every 3 years) pretty soon; nobody has a path to go beyond 8 nm. Given that 8 nm is only a few tens of silicon atoms, we may be hitting a real wall next decade. There is no technology waiting to replace CMOS, as CMOS was available to replace ECL. This will be a major game changer for the IT industry in the next decade: The game will not be anymore finding applications that can leverage the advances of microelectronics, but getting more computation out of a fixed power and (soon) transistor budget. I call research on exascale “the last general rehearsal before the post-Moore era.” Exascale research will push on many of the research directions that are needed to increase “compute efficiency.” Therefore, I believe it is important to push this research now.

insideHPC: Thinking about exascale, there seems to be broad agreement that it isn’t practical to build such a system out of today’s parts because of energy impracticalities. But when it comes to programming models, some people seem to favor an incremental evolution of the same model we use today (communicating sequential processes with something like MPI), while others want to totally start over (e.g., Sterling’s ParalleX work). I’ve been personally surprised by how well the current model extended to the petascale. What are your thoughts about evolution versus revolution in exascale programming approaches?

Marc Snir

Snir: When I was involved with MPI almost 20 years ago, I never dreamed it would be so broadly used 20 years down the road. Again, this is not a black and white choice: One can replace MPI with a more efficient communication layer for small, compute intensive kernels while continuing to use it elsewhere; one can take advantage of the fact that many libraries (e.g., Scalapack) are hiding MPI behind their own, higher-level communication layer, to re-implement their communication layer on another substrate; one can preprocess or compile MPI calls into lower-level, more efficient communication code. One can use PGAS languages which, essentially are syntactic sugar atop one-sided communication. We shall need to shift away from MPI for a variety of reasons that include performance (communication software overhead), an increasing mismatch between the MPI model and the evolution of modern programming languages, the difficulties of working with a hybrid model, etc. The shift can be gradual — MPI can run atop ParalleX. But we have very few proposals so far for a more advanced programming model.

Also posted in HPC People, Rock Stars of HPC | 1 Comment

Video: Supermicro Showcases Twin Server Line

Coming from a hardware background, I love to see the latest gear at ISC. Chips, sockets, heat sinks, and blades: this is the kind of stuff I geek out on. So last week I dropped in on the Supermicro booth with my camera because they always fill entire walls with mother boards from their latest server products.

In this video, Supermicro VP of Marketing Don Clegg gives us a tour of their new Twin family of servers.

Also posted in Compute, Events, GPUs, HPC, HPC Hardware, ISC'10 Feature Stories, ISC10, Video | Leave a comment

Between the ISCs: looking back over the past 12 months of HPC accomplishment

Contribution by regular readers Thomas Sterling (an HPC Rock Star) and Chirag Dekate of Louisiana State University. This article follows Sterling’s review of the past 12 months in HPC, given each year at ISC in Germany.

As the field of HPC enters its second decade of the 21st Century, new directions in system structure, operation, and programming are being driven by the technical trends and application needs at extreme scale.

Thomas Sterling

Unlike never before, even with the expectation of the continuance of Moore’s law, the opportunities of performance gains are threatened by the second turning of the decades’ long S-curve HPC has been traversing. This last year has seen dramatic evidence of the initial flattening with the imposition of power and complexity constraints as well as innovative approaches and market products to address them. At ISC 2010 in Hamburg, Germany, the authors were afforded the opportunity to review the events that best reflect the trends, directions, and accomplishments of the last year by the international supercomputing community: industry, academia, and national facilities and programs. The chosen theme highlighted for this year’s presentation on the state of the field in HPC was “Igniting Exaflops” to underscore and acknowledge the major steps that have been taken over the intervening 12 months to prepare the international community for a future of Exascale computing before the end of this decade. But first, let’s summarize some of the recent key achievements in HPC and their impact as taken from the 7th annual ISC retrospective.

In brief, hex cores for multicore in 32.0 nanometer fabrication technology have become mainstream replacing last generation quad core chips for new product offerings based on commodity clusters that continue to gain market share with respect to MPPs. Sockets combing multiple dies are becoming available with up to 12 cores in cache coherence structure SMPs. Heterogeneous system structures are gaining traction with the increased integration of GPUs for floating point intensive applications.

Equally important in this direction are the advances in programming methodologies, with improved system software merging conventional APIs and CUDA or OpenCL making this emergent class of HPC systems of greater utility to the technical computing end users.

A major competition has been waged in the field of networking for clusters between Ethernet and Infiniband, with Ethernet representing the larger deployed base, but Infiniband dominating the high end systems as well as the total aggregate performance across the Top-500 list. Many applications of scientific and technical importance have been developed, pushing new discovery forward with the first significant Petaflops scale applications running on such machines as Jaguar at Oak Ridge National Laboratory and recognized by the Gordon Bell Prize. Green computing has continued to gain attention with advanced designs and techniques being applied to reduce overall energy requirements and limit the upward surge of peak power demand.

The Top500 and the race to the top

Jaguar at Oak Ridge National Laboratory is still the fastest supercomputer in the world as measured by the Linpack Benchmark (some systems are not similarly rated or reported). With a sustained performance of 1.76 delivered Petaflops (and higher on some applications), this integration of Cray XT4 and XT5 subsystems based on AMD Opterons runs SUSE Linux operating system, an array of compilers from multiple software vendors, and offer support for diverse programming models.

But a new contender for second place comes from ShenZhen, China, and with a Linpack performance of 1.27 Petaflops it handily exceeds the coveted 1 Petaflops threshold. This system uses a heterogeneous system architecture of Dawning TC3600 blades with IntelX5650 processors, and Nvidia Tesla C2050 (Fermi) GPUs. Indeed, the system’s peak performance of nearly 3 Petaflops exceeds that of Jaguar itself.

Roadrunner at LANL, the first Petaflops computer, is now entering its third year of operation using a heterogeneous architecture that incorporates IBM Cell processors with conventional AMD Opterons. Also at Oak Ridge is another Cray system, “Kraken”, that just breaks a Petaflops peak capability using dual hex core Opterons and the advanced Cray SeaStar2+ router. Germany’s Jugene IBM BG/P st Julich also exhibits Petaflops peak performance with almost 300,000 PowerPC 450 cores. China retains its Tianhe system that also peaks above a Petaflops with a cluster combing Intel Xeon and AMD GPUs. Other systems worth note are Russia’s Lomonosov and Shaheen in Saudia Arabia, with both providing hundreds of Teraflops. It should be noted that this year it was Hewlett-Packard that has deployed the largest number of HPC systems, beating out IBM for the top slot. No other supplier even comes close in this market to these two giants.

The year in cores

The foundation of all of each of these super systems is their processor cores, and this year has seen significant advances from the semiconductor component manufacturers.

Intel dominates HPC system deployment and total aggregate performance with a number of slightly different offerings. The Westmere 2-core and 6-core X5600 processors are implemented in 32 nanometer technology. The IBM Power7 architecture is in 45 nanometer, with one of the largest processor dies ever, and pushes clock speed to above 4 GHz. This 8-core package will deliver a maximum of 265 Gigaflops and incorporates advanced pre-fetching of data and instructions. It will be integrated in the Blue Waters machine to be delivered to UIUC next year. The 8- and 12-core AMD Magny-Cours processor (in 45 nanometer technology) uses HyperTransport 4 inter-core communication technology for more efficient cache coherence.

But what of Itanium? In the keynote address by Intel representatives at ISC 2010, no mention was made of its role, although it is known that a future roadmap exists with targets of Poulson in 2012 and Kittson in 2014. However, this year both Microsoft and Red Hat have announced that they will stop supporting this architecture. HP, one of the originators of much of the Itanium design, is expected to continue to deliver products based on the platform.

Also of note: Rock, Sun’s next-generation processor architecture, was terminated during the last year.

Accelerators

Nvidia has delivered its new GPU, Fermi, for improved double precision performance, and is making major strides in releasing improved CUDA and OpenCL software for programmer support. AMD has also advanced its ATI accelerator with the release of Cypress (RV 870) with better than half a Teraflops double precision peak performance.

HPC people

Individual achievements are acknowledged. Ken Miura of Fujitsu was given the Cray Award for his work in vector computing. The Fernbach Award was to Roberto Car and Michele Parrinello for their joint method in molecular dynamics. And the inaugural Kennedy Award was presented to Francine Berman of Rensselaer Polytechnic Institute for her pioneering work in building a national grid based cyberinfrastructure in the US. William Gropp of UIUC was awarded this year’s IEEE TCSC Medal for Excellence in Scalable Computing. He was also recently elected to the US National Academy of Engineering.

With sadness we also note the passing of John Mucci, formerly of Digital Equipment Corporation and a cofounder of SiCortex in 2002.

Getting to exascale

This year also saw the inauguration of the first sponsored programs in Exascale computing.

The International Exascale Software Project (IESP) has involved participants from North America, Europe, and Asia to establish a world-wide coordinated activity to develop the software infrastructure needed in preparation for Exaflops computer architectures targeted for deployment by the end of this decade. The IESP held major technical congresses were held over the last year in France, Japan, and the UK to develop a joint international roadmap.

It is recognized by many (there is controversy on this point) that methods and means for realizing Exaflops scale computing will out of necessity prove very different from those which have successfully brought the field in to the Petaflops era. It has been well understood that historically software has always lagged behind hardware, but this time software must precede hardware both so that we will be ready to use such systems when they are developed, and to inform that development through understanding of software needs.

A second initiative that has been undertaken that will lead to technologies that can be applied to Exascale system deployment is the US DARPA UHPC (Ubiquitous High Performance Computing) program. Although not explicitly established for this purpose, UHPC will produce prototypes of Petaflops racks within the power budget of 60 Kilowatts that could be integrated into full Exascale systems by the end of this decade.

Proposals have been submitted, and DARPA should be announcing the winners before next month. This is a very exciting program with a very real prospect of reinventing how future scalable computing will be achieved.

The US DOE has also launched some new programs relevant to Exascale computing, including one to realize the goal of an X-Stack, the software infrastructure that will be required for Exaflops computing. This program has already received proposals, and will be announcing selected investigators in the near future.

Together, these and other programs begun this year, along with many technical workshops that have also been conducted within the last twelve months, are rapidly putting the world on track to aggressively and effectively move all aspects of system development forward towards the performance goals of the year 2020.

This year has been one of significant product advances, application accomplishments, and initiation of important pathfinding work. The coming year is anticipated to be even more valuable.

Thomas Sterling

Dr. Thomas Sterling is a Professor of Computer Science at Louisiana State University, a Faculty Associate at California Institute of Technology, a CSRI Fellow for Sandia National Laboratories, and a Distinguished Visiting Scientist at Oak Ridge National Laboratory. He has also been recognized as an HPC Rock Star by insideHPC.

Chirag Dekate

Chirag Dekate is pursuing a PhD at LSU; his topic is resource management and scheduling of dynamic data driven graph executions.

Also posted in Events, ISC'10 Feature Stories, ISC10 | 1 Comment

Video: Life is Random, So is Your Storage

I am fascinated with the CGI effects in today’s feature films and it seems like HPC infrastructure has become a competitive weapon for the movie studios. So when I heard that Weta Digital (Lord of the Rings) was using BlueArc, I decided to take a closer look.

In this video, BlueArc’s Director of HPC Marketing Bjorn Andersson talks about why storage loads are so random and how the company’s storage solutions are built for optimal performance.

Also posted in Events, HPC, HPC Hardware, ISC'10 Feature Stories, ISC10, Storage, Video | Leave a comment

Video: Why is Everyone Talking About GPUs for HPC?

I think that NVIDIA was a big winner this year at ISC, and it wasn’t just because China’s new Tesla-powered Nebulae Supercomputer came in at number 2 on the TOP500. The reason for me was rapid adoption; I visited at least half a dozen booths that featured the latest NVIDIA GPUs in a variety of configurations. Clearly, the company is getting traction in the market.

In this video, I interview Andy Keane, NVIDIA General Manager of the Tesla Business Unit and discuss the advantages of GPUs for HPC. He also gives us his views on power efficiency and Exascale computing.

Also posted in Events, GPUs, HPC, HPC Hardware, ISC'10 Feature Stories, ISC10, Video | 2 Comments

Video: New Modeling and Simulation Leadership Panel Seeks Members

In this video, Addison Snell, CEO of Intersect360 Research, describes the formation of the Modeling and Simulation Leadership Panel, a “worldwide panel of organizations using computational modeling, simulation, and analytics to advance their cutting-edge positions in engineering development and research.”

It looks like a sweet deal to me. End users who join the panel will receive free access to research from Intersect360 Research. In return, they agree to do a 30 minute survey on a quarterly basis.

For more information, check out the Modeling and Simulation Leadership Panel site.

Also posted in Events, HPC, ISC'10 Feature Stories, ISC10, Video | Leave a comment

Video: Dynamic GPU Reassignment with NextIO vCORE Appliance

I had grand designs to tape a bunch of demos at ISC10, but time just wouldn’t allow. So after scouting around a bit, I decided to film the best demo that I could find.

In this video, Kyle Geisler from NextIO demonstrates the company’s vCORE Appliance, the “world’s first flexible platform for GPU reassignment in the HPC datacenter.” Watch as he moves GPU resources around on-the-fly, even as they continue to run applications. Consider me impressed, and I can just imagine how powerful this capability will be for putting GPUs to work in the cloud.

Also posted in Datacenter operations, Events, GPUs, HPC, HPC Hardware, ISC'10 Feature Stories, ISC10, System Management, Video | Leave a comment

Advertisement


View All Videos

insideHPC.com is a production of insideHPC, LLC. © 2006-2011 Sitemap