The Violent Waters of HPC

Print Friendly, PDF & Email

Navigating the waters of HPC has never been easy. We’ve seen more than our share of HPC shipwrecks over the years. Majestic schooners with parallel sails were lost at sea. Exotic, powerful motor yachts, despite their high performance engines, ran aground when they failed to chart safe courses.

What led to these failed efforts? Poor technology? Weak sales management? Or a bigger issue tied to government bootstrapping that puts too much political priority on launching the boat by a certain date so an agency can brag about the program’s success? In almost all cases, these ships failed to stay afloat for one reason: they ran out of money.

Most recently, IBM decided to pull the plug on the Blue Waters 10-petaflops supercomputer that was contracted for NCSA – the National Center for Supercomputing Applications at the University of Illinois.

This action is much bigger than just a black eye for IBM. The sinking of this program has serious implications for the entire global HPC community, and for the U.S. No matter how you try to slice and dice, the sinking of Blue Waters comes down to money.

And clearly NCSA and NSF have to share some of the blame when it comes to project management and risk analysis. Trying to recover from this embarrassment, NCSA is forced into a position of trying to find a vendor to build an aircraft carrier with a budget adequate for a small destroyer. Perhaps the root of this problem is government funding agencies approaching exascale R&D as a procurement challenge and not a research challenge.

Intel CTO, Justin Rattner, highlighted this problem in his keynote at SC09 (November, 2009), and again a year later (November, 2010) in his feature interview with The Exascale Report – calling out a government funding model that expects private industry to take huge risks – with no reciprocal rewards.

My Opinion: Supercomputers should be considered an important strategic asset, and funded as such. If the USA really wants to demonstrate world technological leadership, then adequate resources need to be committed, with private industry elevated to that of a partner, not a whipping boy who works out of fear. The failure of Blue Waters should be seen not just as a failure for IBM, but as a failure for the leadership of this country.

The U.S. is demonstrating it is not willing to pay the price of technology leadership in supercomputing

Perhaps the program should never have been launched. Obviously the IBM engineering team under-estimated (by a lot) what it would take to build this system. Rumor has it that the program required at least another $100 million – above the $200 million already allocated.

A number of community leaders say we have to give IBM credit for making the tough decision to say “enough is enough” and walk away from the program. Well, that’s one way of looking at it. But wasn’t this a fixed price contract? They chose to bid on this program – and won the contract based on their agreement to deliver something at a certain price and within a certain timeframe. They screwed up with estimating what would be involved. That’s a part of doing business. Always has been. If anyone could weather this storm and fulfill this contract, it’s IBM. They have the resources to make this happen and see this program through to completion. Walking away from Blue Waters is nothing short of abandoning an important U.S. research institution and the U.S. HPC community.

I feel the need to say this: I have a great amount of respect for IBM, but not for their decision to pull out of Blue Waters.

Private industry can’t be expected to fund applied research and development programs at this level. And I purposely use the word “research” when it comes to programs and systems of this size to differentiate from product development. Adequate funding of research programs at all levels is vitally important to investigating new technologies. But not all research is going to result in systems development. IBM knows this better than most companies.

So when a company such as IBM wins a bid based on their commitment to deliver, and then backs out saying, “Sorry, we didn’t know it was going to cost this much” – it destroys the stability of planning and technology roadmaps that an entire scientific community depends on.

We have a history in this country of pushing the vessels and crews beyond their capabilities. It’s something we seem to be doing again. Have we not learned anything from the past?

Remember the Intel Paragon? How about the CM-5 from Thinking Machines? Their legends and pieces of their technologies live on, but the wisdom of what we learned from those journeys seems to be forgotten. They were tremendous financial drains during their development stages, and yet Intel and Thinking Machines pushed the envelope – and their budgets- as far as they could for fear of falling out of grace with the U.S. government. Some people believe they were pushed too far – particularly in the case of Thinking Machines which didn’t have the deep pockets of Intel to fall back on.

With many funded programs, businesses are often held at arm’s length because the government funding groups don’t want to be associated with helping a business turn a profit. “We want you to build this system – but we don’t want you to make any money on it.” And, when government funds are involved, why the need for so much secrecy?

In order to get to exascale, we need to fund the research element – without putting the pressure on any single company to build these massive systems at the risk of huge financial losses. This is a lesson for all of us. What more evidence do we need to convince us it’s time for a different approach.

A Pirate’s Chest of Booty or a Trip to Davey Jones’ Locker?

So we hear that NCSA has a chest full of booty – $200 million dollars it’s anxious to spend. If NCSA doesn’t get the opportunity to continue down this path and bring this system to market – in whatever flavor or color it turns out to be, HPC stands to lose some important momentum. While the ultimate finish line was not reached for Blue Waters, there undoubtedly has been much learned in the process. But the forced, artificial dates and unreasonable, aggressive development schedules with woefully inadequate budgets is just a formula for more disappointment.

If NCSA is forced to seek a re-compete of the Blue Waters program, the funds could end up disappearing from supercomputing research altogether. That would be a huge loss for HPC and the scientific community.

So, what is the best option? Sadly, it seems the best option is for NCSA to bring in another vendor who will take on the burden of trying to salvage this program. It’s a high risk venture. We can only hope that whoever steps in to work with NCSA has the guts and determination to complete this important program – and not just see it as short-term revenue. We, the global HPC community, and the nation’s scientific computing research infrastructure – need this program – and others like it- to be successful.

Will the Blue Waters Turn Red for HPC?

Sharks are circling and they smell blood in the water. But a re-compete is not practical if we don’t adjust the timelines – take time to absorb what we’ve learned here – and put the proper funding and resources in place to do this properly. The government, in this case we mean NSF, is not likely to change its original timeline expectations, so, in my mind; this is not a good solution.

One solution would be to hold IBM to their contract and force them to deliver. But IBM doesn’t like bleeding red, and based on the responses I’ve heard so far, that ship has sailed.

Re-compete – or re-assign. Select a vendor from a short list or open up bidding to the entire community. Just to state it again – for our money, we believe re-compete is not a good option.

The decisions surrounding Blue Waters will have a significant impact on the HPC community, on scientific computing, and on U.S. competitiveness. We can only hope for calm seas and smooth sailing – and perhaps some open and honest communication with the community would help us all to learn – so we can avoid repeating disasters like this in the future.

For related stories, visit The Exascale Report Archives.

Comments

  1. ExascaleGuest says
  2. ExascaleGuest says
  3. ExascaleGuest says
  4. Mike Bernhardt says
  5. ExascaleGuest says
  6. ExascaleGuest says
  7. ExascaleGuest says