For those interested in a little more detail on AMD’s latest shot at its own foot, The DailyTech has a nice analysis of the TLB bug.
The bug, which has been documented since at least early November, can cause a deadlock during recursive or nested cache writes.
How does the TLB erratum occur? All AMD quad-core processors utilize a shared L3 cache. In instances where the software uses nested memory pages, this processor will experience a race condition.
Impact on HPC? Unlcear. We do know that Cray was allowed it’s allotment (part of which is scheduled for my center…the ERDC MSRC…and part of which may be going to ORNL)
AMD partners tell DailyTech that all bulk Barcelona shipments have been halted pending application screening based on the customer. Cray, for example, was allowed its latest allocation for machines that will not use these nested virtualization techniques. Other AMD corporate customers were told to use Revision F3 (K8) processors in the meantime.
It does appear that if you’re intending to run virtualization software on the chips, you aren’t going to be getting Barcelona for a while.
In the software world, a typical memory race condition occurs when the memory arbiter is instructed to overwrite an older block of memory, but write the old block of memory to somewhere else in cache. In the instance where two arbiters follow this same rule set, its easy to see how a race condition can occur: both arbiters attempt to overwrite the same blocks of information, resulting in a deadlock.
From what AMD engineers would tell DailyTech, this example is very similar to what occurs with nested memory pages in virtualized machines on these K10 processors.
The Register is also covering the fun, and I’m of a similar mind regarding AMD’s evasion of responsibility with the language here. The company is denying this is a “stop ship”
“We haven’t changed the shipping pattern,” AMD man Phil Hughes told InternetNews. “It’s only a stop ship if it’s shipping in volume, and we’re only shipping Barcelona for specific customer commitments, like larger volume deployments.”
AMD seems to be fiddling with language, as far as we’re concerned.
Look AMD: you screwed up. Take it like a man, fix the problem, and try to move on.
Cray XT4 quad-core systems use Budapest, not Barcelona, right? Are these Cray-bound Barcelona’s for the XT5?
Jay – you are correct; the culprit is Budapest in the XT4, not the Barcelona. It appears that both flavors share the TLB bug.
Why are all these reports not giving proper credit to Scott Wasson of The Tech Report, who followed this story from the outset and was the first to get AMD to publicly admit the erratum? See the following articles for the original story:
http://techreport.com/discussions.x/13721
http://techreport.com/discussions.x/13724
http://techreport.com/discussions.x/13742
http://techreport.com/articles.x/13741
Steve: they are as far as I’ve seen. All of the coverage I’ve read (probably not all of the coverage out there, but all I’ve read), refers to Scott’s original, and if you link through to the DailyTech’s coverage, they mention and link to the Tech Report. I don’t because I excerpted the DT, not TTR.