<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Andrew&#8217;s Corner: My Supercomputer Lied to Me!</title>
	<atom:link href="http://insidehpc.com/2009/03/24/andrews-corner-my-supercomputer-lied-to-me/feed/" rel="self" type="application/rss+xml" />
	<link>http://insidehpc.com/2009/03/24/andrews-corner-my-supercomputer-lied-to-me/</link>
	<description>HPC News Without the Noise for Supercomputing Professionals &#124; insideHPC</description>
	<lastBuildDate>Sun, 09 Jun 2013 01:54:13 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.1</generator>
	<item>
		<title>By: PaulAdams</title>
		<link>http://insidehpc.com/2009/03/24/andrews-corner-my-supercomputer-lied-to-me/#comment-157107</link>
		<dc:creator>PaulAdams</dc:creator>
		<pubDate>Tue, 24 Mar 2009 23:13:09 +0000</pubDate>
		<guid isPermaLink="false">http://insidehpc.com/?p=4098#comment-157107</guid>
		<description>While working as an applications analyst, a user called in and said that their code was nondeterministic on a certain vendors equipment. &quot;Of course, that could not be the case&quot;, I thought, &quot;It must be the user&#039;s error.&quot; It turns out that the user was right. Depending on the CPU, the code either always gave the right answer, or sometimes gave the wrong answer. By pinning the program to the CPU we were able to certify which ones were giving incorrect answers and prove it to the vendor. The vendor then traced it to improper voltage going to a certain chip. A firmware update fixed everything.

I wonder how many researchers touted their results in publications when those results were incorrect. Since no other researcher complained, no other researcher was told. After all, it must have been a unique combination of instructions that caused it.

Right?

By the way, this was not an isolated case.</description>
		<content:encoded><![CDATA[<p>While working as an applications analyst, a user called in and said that their code was nondeterministic on a certain vendors equipment. &#8220;Of course, that could not be the case&#8221;, I thought, &#8220;It must be the user&#8217;s error.&#8221; It turns out that the user was right. Depending on the CPU, the code either always gave the right answer, or sometimes gave the wrong answer. By pinning the program to the CPU we were able to certify which ones were giving incorrect answers and prove it to the vendor. The vendor then traced it to improper voltage going to a certain chip. A firmware update fixed everything.</p>
<p>I wonder how many researchers touted their results in publications when those results were incorrect. Since no other researcher complained, no other researcher was told. After all, it must have been a unique combination of instructions that caused it.</p>
<p>Right?</p>
<p>By the way, this was not an isolated case.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Richard Hickey</title>
		<link>http://insidehpc.com/2009/03/24/andrews-corner-my-supercomputer-lied-to-me/#comment-157076</link>
		<dc:creator>Richard Hickey</dc:creator>
		<pubDate>Tue, 24 Mar 2009 21:06:18 +0000</pubDate>
		<guid isPermaLink="false">http://insidehpc.com/?p=4098#comment-157076</guid>
		<description>Chaos theory. (And I&#039;m sure I use the definition wrong). Upon small changes large changes are based. 

I&#039;m fairly sure the occasional wrong number being passed wasn&#039;t all that critical. If it wasn&#039;t caught as an outright error by the code, bounds checking, then it probably was just buried in the noise. That and the percentage of erroneous changes was so low that in a normal (define normal) model run it probably didn&#039;t occur all that often if at all.

I&#039;m just paranoid.</description>
		<content:encoded><![CDATA[<p>Chaos theory. (And I&#8217;m sure I use the definition wrong). Upon small changes large changes are based. </p>
<p>I&#8217;m fairly sure the occasional wrong number being passed wasn&#8217;t all that critical. If it wasn&#8217;t caught as an outright error by the code, bounds checking, then it probably was just buried in the noise. That and the percentage of erroneous changes was so low that in a normal (define normal) model run it probably didn&#8217;t occur all that often if at all.</p>
<p>I&#8217;m just paranoid.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John Leidel</title>
		<link>http://insidehpc.com/2009/03/24/andrews-corner-my-supercomputer-lied-to-me/#comment-157073</link>
		<dc:creator>John Leidel</dc:creator>
		<pubDate>Tue, 24 Mar 2009 20:53:56 +0000</pubDate>
		<guid isPermaLink="false">http://insidehpc.com/?p=4098#comment-157073</guid>
		<description>Rich, as an aside, you should probably define what &quot;wrong&quot; is.  Many codes, especially those based on theoretical science, should have proper bounds checking involved.</description>
		<content:encoded><![CDATA[<p>Rich, as an aside, you should probably define what &#8220;wrong&#8221; is.  Many codes, especially those based on theoretical science, should have proper bounds checking involved.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Richard Hickey</title>
		<link>http://insidehpc.com/2009/03/24/andrews-corner-my-supercomputer-lied-to-me/#comment-157060</link>
		<dc:creator>Richard Hickey</dc:creator>
		<pubDate>Tue, 24 Mar 2009 19:08:33 +0000</pubDate>
		<guid isPermaLink="false">http://insidehpc.com/?p=4098#comment-157060</guid>
		<description>Sometimes it&#039;s not even a problem with the algorithms, models, or code. Sometimes there is a hidden bug deep in a system that can give erroneous results. 

A friend and ex-co-worker, Guy R., about drove me to drink when I was a support person for a large 3 letter company. Seems if you sent a random number around with mpi a few million times then checked the number against the original, well, sometimes they were different. And the system was happy with that. Turns out there was a buffer refresh issue that hit only occasionally. It was pure chance and Guy&#039;s attention to detail that even let us know there was a problem, let alone find the cause.

This raises the question. How many codes and models got the wrong answer? Who knows. This happened to be on a well established platform doing large government research projects. Everyone pretty much assumed it was giving the right answers.

Moral of the story. Check the results and don&#039;t believe everything the computer tells you. Sometimes 1 + 1 = 1.99999 and that&#039;s just fine. Other times 1 + 1 = 1.99998 and the lander smacks the ground.</description>
		<content:encoded><![CDATA[<p>Sometimes it&#8217;s not even a problem with the algorithms, models, or code. Sometimes there is a hidden bug deep in a system that can give erroneous results. </p>
<p>A friend and ex-co-worker, Guy R., about drove me to drink when I was a support person for a large 3 letter company. Seems if you sent a random number around with mpi a few million times then checked the number against the original, well, sometimes they were different. And the system was happy with that. Turns out there was a buffer refresh issue that hit only occasionally. It was pure chance and Guy&#8217;s attention to detail that even let us know there was a problem, let alone find the cause.</p>
<p>This raises the question. How many codes and models got the wrong answer? Who knows. This happened to be on a well established platform doing large government research projects. Everyone pretty much assumed it was giving the right answers.</p>
<p>Moral of the story. Check the results and don&#8217;t believe everything the computer tells you. Sometimes 1 + 1 = 1.99999 and that&#8217;s just fine. Other times 1 + 1 = 1.99998 and the lander smacks the ground.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John Leidel</title>
		<link>http://insidehpc.com/2009/03/24/andrews-corner-my-supercomputer-lied-to-me/#comment-157032</link>
		<dc:creator>John Leidel</dc:creator>
		<pubDate>Tue, 24 Mar 2009 16:21:02 +0000</pubDate>
		<guid isPermaLink="false">http://insidehpc.com/?p=4098#comment-157032</guid>
		<description>Brian, you&#039;ve hit it right on the head.  Quite often I&#039;ve also experienced the &#039;NIH&#039; syndrome.  IE, Not Invented Here.  &quot;If I didn&#039;t write it, it most certainly couldn&#039;t be correct.&quot;
The moral of the story is, do your homework and check your results.  There may, in fact, be a better solution available to the same problem.  You might also be able to find a similar solution to a related problem, which is more often the case.</description>
		<content:encoded><![CDATA[<p>Brian, you&#8217;ve hit it right on the head.  Quite often I&#8217;ve also experienced the &#8216;NIH&#8217; syndrome.  IE, Not Invented Here.  &#8220;If I didn&#8217;t write it, it most certainly couldn&#8217;t be correct.&#8221;<br />
The moral of the story is, do your homework and check your results.  There may, in fact, be a better solution available to the same problem.  You might also be able to find a similar solution to a related problem, which is more often the case.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brian</title>
		<link>http://insidehpc.com/2009/03/24/andrews-corner-my-supercomputer-lied-to-me/#comment-157018</link>
		<dc:creator>Brian</dc:creator>
		<pubDate>Tue, 24 Mar 2009 14:57:47 +0000</pubDate>
		<guid isPermaLink="false">http://insidehpc.com/?p=4098#comment-157018</guid>
		<description>While there might be very few &#039;applications analysts&#039; that are less than reputable, the fact is very few people in HPC have that title.  Computational methods are being applied to more and more domain all the time, and to my sincere dismay, very little educational time is being spent on the nature of these computations.  A great number of younger users believe quite firmly that a computer gives THE answer, not just an answer within a certain realm of numerical error.  As people spend more on hardware to get faster results, but ignore spending money on training people in numerical methods and computational methodologies, this becomes more and more rampant.

I have seen someone running an O(N^2) algorithm when an O(N) one exists, and when they were encouraged to switch, they expressed some dismay that the new algorithm, while statistically equal, gave them slightly different results.  That is, they took the number their code gave them as gospel, and to hell with different compilers, optimization levels or, worst of all, algorithms.  Even when those newer algorithms sped things up thousands of times.  

The pointy-haired bosses like to spend millions on the shiny new system, but if one were to examine the data, I&#039;d quite easily wager that in many cases, spending a bit less on hardware and putting those funds into training would deliver more science per buck.</description>
		<content:encoded><![CDATA[<p>While there might be very few &#8216;applications analysts&#8217; that are less than reputable, the fact is very few people in HPC have that title.  Computational methods are being applied to more and more domain all the time, and to my sincere dismay, very little educational time is being spent on the nature of these computations.  A great number of younger users believe quite firmly that a computer gives THE answer, not just an answer within a certain realm of numerical error.  As people spend more on hardware to get faster results, but ignore spending money on training people in numerical methods and computational methodologies, this becomes more and more rampant.</p>
<p>I have seen someone running an O(N^2) algorithm when an O(N) one exists, and when they were encouraged to switch, they expressed some dismay that the new algorithm, while statistically equal, gave them slightly different results.  That is, they took the number their code gave them as gospel, and to hell with different compilers, optimization levels or, worst of all, algorithms.  Even when those newer algorithms sped things up thousands of times.  </p>
<p>The pointy-haired bosses like to spend millions on the shiny new system, but if one were to examine the data, I&#8217;d quite easily wager that in many cases, spending a bit less on hardware and putting those funds into training would deliver more science per buck.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
