Ranger giving some users fits

Print Friendly, PDF & Email

I read an interesting post on Stephen Skory’s blog. Stephen is a grad student in UCSD’s Physics department. For the purposes of this post, he is interesting because he’s reporting actual experiences on some recent HPC machines.

First, Ranger

A few months ago, Ranger was turned on. It is a Sun cluster in Texas with 63,000 Intel CPU cores. It is currently ranked fourth fastest in the world. Datastar has only 2528 CPUs (but those are real CPUs, while Ranger has mutli-core chips which in reality aren’t as good). By raw numbers, Ranger is an order of magnitude better than Datastar, except that Ranger doesn’t work very well. Many different people are seeing memory leaks using vastly different codes. These codes work well on other machines. I have yet to be able to run anything at all on Ranger. For all intents and purposes, Ranger is useless to me right now.

Now, as an HPC center guy, I’ve deployed a lot of big systems ranging from 1o to 25 on the Top500 list. And I know that big systems ALL require a shakedown period. But this doesn’t get reported on very often by real users, so I wanted to point to it here.

Then, some interesting “friend of” information on RoadRunner

Having two kind of chips adds a layer of complexity, which makes the machine less useful. The Cell processor is a vector processor, which is only awesome for very specially written code. The machine is fast, except it’s also highly unusable. I don’t have access to it because it’s a DOE machine, but a colleague has tried it and says he got under 0.1% peak theoretical speed out of it. Other people were seeing similar numbers. No one ever gets 100% from any machine, but 0.1% is terrible.

A little iffy on the facts, but it’s a blog post, not an NYT article. Always interesting to get that real user experience out there to remind us HPC people that the users don’t really care what’s new, sexy, or cool. They want fast that works. And a little slower that works is always better than fast that doesn’t work. Always.

Comments

  1. Michael L. says

    As an HPC sysadmin working in a university setting, I am all-too familiar with the negative opinions of outspoken graduate students. Often times, new or unfamiliar systems are panned by researchers as poor-performing or “broken” when the problem actually lies with their lack of knowledge on how to properly use the system in question. Parallel code suffers from a terrible lack of portability but this fact does not seem to sink into a certain class of users who assume that their codes should work the same across all systems/compilers/library versions. In my shop, entire platforms have been smeared as “poor performing” simply because our user base did not know how to properly use a complier suite other than GCC. Mere technical understandings may not be worth getting worked up over but, more seriously than that, I’ve seen careers damaged by these misinformed accusations.

    Cutting edge systems will tend to suffer from their own novelty: hardware combinations never seen before tend to produce unique hardware problems. However, the tendency of researchers to mistake their own lack of understanding for a broken system should not be underestimated.

  2. Michael, a rebuttal of sorts,

    Yes every new hybrid, exotic, bleeding edge system made will go through a rough patch as people learn to use them. This is a fact of life. But, why should a physics professor trying to study some strange state of matter need to learn to be a programming expert on different exotic machines every time there is a system refresh? He/she just wants to have their code run in a timely manner.

    I’m a systems analyst on large systems myself, and so this is what I do for a living, I love the new and unusual. For me the new and exotic is a real kick in the shorts. I read somewhere, correct me if I get this wrong, that the average Physics PhD student gets on average 250 hours of computer instruction during his or her education. That’s it. Out of a six to eight year program, only 250 hours.

    I can fully understand the frustration of a user when it comes to porting there code to a different machine. It can be a bloody nightmare. We joke here at HECToR that all of our new users are going to be the 50 year old plus crowd. We just installed a Vector Cray X2 to our 11k+ processor XT4. None of the new crowd of graduates has ever played in a Vector arena like this, so, big learning curve time. It’s going to turn off some of our users, it’s just a fact of life.

    Long rambling reply to try and say that I can see both sides of this fence. The new hybrid Roadrunner ROCKS. But on the other hand, I’m glad I don’t have to port code to it. (-:

    Rich

  3. Michael L. says

    Rich,

    I actually agree with just about everything you write. The HPC space is dense and inaccessible and I do sympathize with my users who have to learn and deal with complicated software stacks that have nothing to do with their research area. The issue of the inaccessibility of the scientific computing arena is separate, although related, to my original comment. given the tone of that comment, its probably obvious that I’ve had personal experiences with researchers who are quick to call something broken simply because they do not understand how to use it correctly. My problem lies not with the lack of understanding but rather with the propensity of some to assign blame to others when their codes do not work correctly. I’ve seen these allegations turn into assumptions that the administrative staff is not be doing their jobs correctly, something that I took very personally.

    One could accuse me of making my own assumptions here as I am conflating my own experiences with those of the greater HPC realm. We could also launch from here into discussions of the absurdity of the flop arms race or the need for better tools that exacerbate the situation that I describe. Let me just say that we should take the claims of a user that a system is “broken” with a chunk of salt.

  4. I’ve also had the pleasure of dealing with angry users from time to time. Unfortunately, I would have to disagree with the users that this is our [the HPC community] fault. Indeed, the latest round of HPC platforms don’t bode well for the C++ 101-level programmer. Indeed, the hardware and software have become much more complex. Thus is the natural progression of improving performance. The latest round of hybrid architectures are going to require some very clever application analysts in order to utilize them efficiently. Even then, “its application specific…”

  5. “The average Physics PhD student gets on average 250 hours of computer instruction during his or her education.”

    I presume you are talking about total hours, not semester hours. I only had 7 semester hours (3 classes), or about 280 wall clock hours. Your stat sounds about right. However, prior to 2006, there were no Cell processors for Universities to use, much less teach a class on how to program.

    Is programming these machines hard? Yes. I shudder to think about having to program RoadRunner. If I had the hours on RoadRunner, I would take advantage of any books on the Cell architecture, any classes that the staff taught, &c.

    Just because a user says a machine is broken does not actually mean that it is broken. The user has to continually Learn how to get the most out of the latest hardware, from papers, books, staff, tools, and anything else they can get their hands on.

  6. The original poster points out “Many different people are seeing memory leaks using vastly different codes.” Having talked to many ranger users at the teragrid meeting last month I would say that that is a fair statement.

    This points to a problem with the MACHINE. The knee jerk reaction of blaming the user does not seem to apply in this case. Perhaps the poster up there ^^^ should first look at his prejudices before blaming users. Unless 20+ users all of who’s code work on other machines are all wrong.

  7. Michael L. says

    To anon:

    What I’m saying is that we should not assume that the system is broken simply because a user reports it as so. I’m not blaming the user for anything other than overstatement and oversimplification.

    So there are memory leaks in some users codes: why is this? Is the problem in the hardware, software, is the old code still applicable to this system? Are they linking against the same MPI libraries as on other systems? If so, are the versions the same? I seriously doubt that the problem lies in the “MACHINE” (I’m assuming you mean hardware here) but rather in the complex relationship of the software components running on said hardware.

    Every new cluster comes with a period of growing pains as the users become familiar with it and the educational resources grow. Does the user have a right to complain? Sure, Ranger is useless to him. However, statements like “unlike Ranger, it actually works” or “supercomputers aren’t getting better; in some cases, they’re getting worse” are patently false and constitute the sort of hyperbole that I was warning against taking seriously above. Remember that I said that careers can and are harmed by this sort of overstatement so its more than just a case of sysadmin sensitivity.

    Am I prejudiced by my own experiences? Definitely, but so is the grad student in question and I believe that I’m being a bit more fair to him than he was to Ranger/TACC.

  8. Stephen Skory says

    To all:

    Wow! I never thought my little post aimed at my family would spark such a discussion. I should be more careful in the future when I write publicly. Lesson learned.

    I did write the post in a frustrated mood, but I don’t think I was too unkind. I didn’t name any names, nor get into specifics. Nothing I said about Ranger is untrue, as far as I know. I should have pointed out (as some of you have) that in the future Ranger will probably be a great machine for everyone, but it isn’t one yet for me.

    The part about Roadrunner is pure scuttlebutt, and should be taken as such. I’m pretty sure that the code that was run wasn’t at all optimized for the Cell, so that’s why it ran so lousy.

    I’ll readily admit I am not an expert programmer. I am a physicist first so a functional computer is primary for me.

    I hope I haven’t offended anyone!

  9. Stephen – not a bit. I hope you keep writing pure, unvarnished reactions to the tools that HPC gives you to get your job done. Everything that everyone has said on this thread is true. HPC machines do often perform terribly after commissioning, and users are often quick to blame a machine for their own problems.

    A key meta point here that the HPC community is ENTIRELY responsible for is that the machines are often so hard to use that it can be hard to tell where the truth lies in every particular instance.

    And before someone flames me, remember I’m not some rookie mouthing off – I’ve been fielding Top 20 HPC machines for 15 years for a user community of about 1,000, so I have actual experience on this.