Andrew's Corner: My Supercomputer Lied to Me!

Andrew Jones has once again published his monthly insight into all things HPC on ZDNet UK.  This month, he touches on a widely known, but seldom discussed topic in high performance computing.  Supercomputers lie!  Given the progression of technology is HPC, we can easily get caught up in the “speeds and feeds” speak.  Our new machine runs the same algorithm some N times faster than our old machine.  Indeed, the value of `$> time my_app` is most likely correct, but what about the numerical results?

Many users of models are rigorous about validating their predictions, especially those users with a strong link to the advancement of the model or its underpinning science. But, unfortunately, not all users of models are so scrupulous.

They think the model must be right — after all, it is running at a higher resolution than before, or with physics algorithm v2.0, or some other enhancement, so the answers must be more accurate. Or they assume it is the model supplier’s job to make sure it is correct. And yes, it is — but how often do users check that their prediction relies on a certified part of parameter space? [Andrew Jones]

Before you hit ‘send’ on that flaming comment, we realize that there are probably very few applications analysts that are less than reputable.  Given the complexity of many modern high performance computing applications, bugs will exist and precision errors can occur without deliberate provocation.  In mentoring the younger additions to our community, I always offer up two important pieces of advice: First, always check your results/precision and second, never eat the yellow snow.

As always, Andrew’s article is a great read.  Head over to ZDnet UK and read it here.

Comments

  1. While there might be very few ‘applications analysts’ that are less than reputable, the fact is very few people in HPC have that title. Computational methods are being applied to more and more domain all the time, and to my sincere dismay, very little educational time is being spent on the nature of these computations. A great number of younger users believe quite firmly that a computer gives THE answer, not just an answer within a certain realm of numerical error. As people spend more on hardware to get faster results, but ignore spending money on training people in numerical methods and computational methodologies, this becomes more and more rampant.

    I have seen someone running an O(N^2) algorithm when an O(N) one exists, and when they were encouraged to switch, they expressed some dismay that the new algorithm, while statistically equal, gave them slightly different results. That is, they took the number their code gave them as gospel, and to hell with different compilers, optimization levels or, worst of all, algorithms. Even when those newer algorithms sped things up thousands of times.

    The pointy-haired bosses like to spend millions on the shiny new system, but if one were to examine the data, I’d quite easily wager that in many cases, spending a bit less on hardware and putting those funds into training would deliver more science per buck.

  2. Brian, you’ve hit it right on the head. Quite often I’ve also experienced the ‘NIH’ syndrome. IE, Not Invented Here. “If I didn’t write it, it most certainly couldn’t be correct.”
    The moral of the story is, do your homework and check your results. There may, in fact, be a better solution available to the same problem. You might also be able to find a similar solution to a related problem, which is more often the case.

  3. Sometimes it’s not even a problem with the algorithms, models, or code. Sometimes there is a hidden bug deep in a system that can give erroneous results.

    A friend and ex-co-worker, Guy R., about drove me to drink when I was a support person for a large 3 letter company. Seems if you sent a random number around with mpi a few million times then checked the number against the original, well, sometimes they were different. And the system was happy with that. Turns out there was a buffer refresh issue that hit only occasionally. It was pure chance and Guy’s attention to detail that even let us know there was a problem, let alone find the cause.

    This raises the question. How many codes and models got the wrong answer? Who knows. This happened to be on a well established platform doing large government research projects. Everyone pretty much assumed it was giving the right answers.

    Moral of the story. Check the results and don’t believe everything the computer tells you. Sometimes 1 + 1 = 1.99999 and that’s just fine. Other times 1 + 1 = 1.99998 and the lander smacks the ground.

  4. John Leidel says

    Rich, as an aside, you should probably define what “wrong” is. Many codes, especially those based on theoretical science, should have proper bounds checking involved.

  5. Chaos theory. (And I’m sure I use the definition wrong). Upon small changes large changes are based.

    I’m fairly sure the occasional wrong number being passed wasn’t all that critical. If it wasn’t caught as an outright error by the code, bounds checking, then it probably was just buried in the noise. That and the percentage of erroneous changes was so low that in a normal (define normal) model run it probably didn’t occur all that often if at all.

    I’m just paranoid.

  6. While working as an applications analyst, a user called in and said that their code was nondeterministic on a certain vendors equipment. “Of course, that could not be the case”, I thought, “It must be the user’s error.” It turns out that the user was right. Depending on the CPU, the code either always gave the right answer, or sometimes gave the wrong answer. By pinning the program to the CPU we were able to certify which ones were giving incorrect answers and prove it to the vendor. The vendor then traced it to improper voltage going to a certain chip. A firmware update fixed everything.

    I wonder how many researchers touted their results in publications when those results were incorrect. Since no other researcher complained, no other researcher was told. After all, it must have been a unique combination of instructions that caused it.

    Right?

    By the way, this was not an isolated case.