I read an interesting post on Stephen Skory’s blog. Stephen is a grad student in UCSD’s Physics department. For the purposes of this post, he is interesting because he’s reporting actual experiences on some recent HPC machines.
A few months ago, Ranger was turned on. It is a Sun cluster in Texas with 63,000 Intel CPU cores. It is currently ranked fourth fastest in the world. Datastar has only 2528 CPUs (but those are real CPUs, while Ranger has mutli-core chips which in reality aren’t as good). By raw numbers, Ranger is an order of magnitude better than Datastar, except that Ranger doesn’t work very well. Many different people are seeing memory leaks using vastly different codes. These codes work well on other machines. I have yet to be able to run anything at all on Ranger. For all intents and purposes, Ranger is useless to me right now.
Now, as an HPC center guy, I’ve deployed a lot of big systems ranging from 1o to 25 on the Top500 list. And I know that big systems ALL require a shakedown period. But this doesn’t get reported on very often by real users, so I wanted to point to it here.
Then, some interesting “friend of” information on RoadRunner
Having two kind of chips adds a layer of complexity, which makes the machine less useful. The Cell processor is a vector processor, which is only awesome for very specially written code. The machine is fast, except it’s also highly unusable. I don’t have access to it because it’s a DOE machine, but a colleague has tried it and says he got under 0.1% peak theoretical speed out of it. Other people were seeing similar numbers. No one ever gets 100% from any machine, but 0.1% is terrible.
A little iffy on the facts, but it’s a blog post, not an NYT article. Always interesting to get that real user experience out there to remind us HPC people that the users don’t really care what’s new, sexy, or cool. They want fast that works. And a little slower that works is always better than fast that doesn’t work. Always.