Intel doesn’t seem to be in a hurry to get its own line of Knights coprocessors for HPC applications into the field, and maybe it doesn’t have to be.
To be sure, Nvidia is stealing most of the oxygen in the conversation about coprocessors for accelerating supercomputer applications with its Tesla family of GPU accelerators. Most recently with the launch of the Tesla M2090 fanless coprocessors, which are based on its “Fermi” GPUs. Advanced Micro Devices is getting some, but not very much, traction with its FireStream coprocessors, built from its “Cypress” GPU chips, and is even retrofitting its FirePro workstation graphics cards for servers to support GPU acceleration with the fanless FirePro V7800P card, announced last month to try to blunt Nvidia’s attack on the HPC centers of the world.
Thus far, Intel’s Many Integrated Core (MIC) is little more than a research project. Intel picked up the remnants of the failed “Larrabee” graphics card project and rechristened it Knights and put it solely in the service of the king of computing, the CPU. The Larrabee graphics co-processor made a brief debut at the SC09 supercomputing trade show in November 2009, hitting one teraflops running the SGEMM single precision, dense matrix multiply benchmark when Intel overclocked it. A few weeks later, Intel said that Larrabee was being denigrated to research status , and last May, ahead of the International Super Computing 2010 event in Hamburg, the chip maker said it was not going to enter the discrete graphics card business to take on Nvidia and AMD after all. And thus was born the “Knights” family of Many Independent Core (MIC) discrete coprocessors for HPC applications.
A year ago, at ISC, Intel was talking up the MIC architecture, and at this year’s ISC, which takes place this week, Intel has made a few baby steps toward getting the first MIC coprocessors to market. In a briefing with the press ahead of ISC, Anthony Neal-Graves, general manager of workstations and MIC computing at Intel, confirmed that the first MIC device, the “Knights Ferry” development platform, is being ramped as planned this year and that the first MIC commercial coprocessor, called “Knights Corner”, will be launched using Intel’s22 nanometer Tri-Gate process technology. That same process is being used forfuture Xeon and Itanium processors, including the “Poulson” Itaniums and the “Ivy Bridge” Xeons, both due in 2012. It is not clear if Intel will roll out the MIC coprocessors ahead of the Poulson or Ivy Bridge chips, or get them out behind them, and the chip giant had no intention of clearing this up at ISC this week.
As the Top 500 list of supercomputers has shown for the past two years, advances in hybrid computing tools and the desire to do more computing for less money, using less electricity, and generating less heat is calling into question the use of CPU-only parallel machines to run giant simulations. But this market is far from mature yet, so Intel probably has enough time to get MIC coprocessors into the field before Nvidia gets a nearly insurmountable lead in coprocessors. And that is for two reasons: Intel has an economic advantage with its vast chip-making operations, and unlike GPU coprocessors the Knights coprocessors run plain old x64 code.
“At the end of the day, folks will go to wherever they can get maximum performance for the least amount of effort,” Neal-Graves explained on the call.
You don’t have to have a supercomputer and run a big simulation to see that things tend toward the lowest energy state in the universe. Intel is counting on the combination of its Fortran, C, and C++ compilers, which are popular among HPC shops, the MIC coprocessors, with their x64 instruction set, and its manufacturing prowess to allow it to at least catching up to Nvidia with its Tesla coprocessors and CUDA programming environment. Given all of this, maybe Intel doesn’t have to be in a hurry. Or, to say it another way, to get the kind of performance Intel needs to demonstrate with MIC coprocessors to pull even with Nvidia next year, maybe it cannot go any faster because of the work it needs to do in its compilers and in perfecting the 22 nanometer processes.
Conceptually, here’s what a MIC coprocessor looks like:
The MIC chips put multiple processing units consisting of an x64 core, a vector processor, and some cache memory into a module, and then cookie-cutter them onto the chip with a fast ring interconnect keeps the caches for each chip coherent (so they can share data quickly and function more or less like a baby parallel supercomputer). The MIC chip has a superscalar x64 core (without the out-of-order execution of Xeons, so akin to the Atom chip in some respects) and a 512-bit vector math unit that can do 16 floating point operations per clock with single precision math.
The Knights Ferry software development platform is based on a 32-core MIC design code-named “Aubrey Isle,” shown below:
The Aubrey Isle chip is implemented using Intel’s 45 nanometer processes and puts 32 cores running at 1.2GHz on the die; each core has four instruction threads, given 128 execution threads per die. The chip has 8MB of shared L2 cache plus either 1GB or 2GB of GDDR5 graphics memory. (Yes, it is a graphics card, even if Intel says it isn’t, although good luck trying to find a video driver for it.) Each core has 32KB of L1 instruction and 32KB of L1 data cache, and the cores can talk to the GDDR5 memory through PCI direct memory access operations with virtual addressing.
As for the production chip, all Intel has said, and which it reiterated for the ISC briefing, is that the future Knights Corner coprocessor would have more than 50 cores and would use the 22 nanometer process. The rumored delivery for the chip is the second half of 2012, which means Knights Corner follows Ivy Bridge PC and server chips and probably Poulson Itaniums, too. If I had to guess, I would say that the chip used in the Knights Corner coprocessor will have 64 cores and vector units on a ring interconnect – or perhaps multiple rings within rings – and that the number of activated cores and vectors will depend on how many bits of gunk kill off computing units on a chip as the 22 nanometer processes ramp.
This is perfectly normal in the GPU space. For instance, with the Fermi GPUs from Nvidia, the design has 512 cores, but the first year of shipments for the M2050 and M2070 units had GPUs with only 448 cores working; only with the just-announced M2090s and the future X2090s (for embedding right onto server system boards) are all 512 cores working. Similarly, CPUs that are designed with a high number of cores are sold with SKUs with fewer cores because of boogers on the chips; using these semi-dud chips increases yields and hurts no one.
PCI today, on-package tomorrow
Intel is not confirming if the Knights Corner coprocessor will plug into servers using the PCI-Express 2.0 interface or the faster PCI-Express 3.0 interface that will initially debut in the third quarter of this year with Intel’s “Sandy Bridge” Xeon E5 processors and Advanced Micro Devices’ “Interlagos” Opteron 6200s. Intel could go either way – or both ways depending on what prospective customers are telling it. The PCI bus is the bottleneck for coprocessors and networks alike at this point, so you have to believe that Intel is tempted to only support PCI-Express 3.0 slots given the low volume of MIC coprocessors it will ship and its desire to pair these with Sandy Bridge server upgrades.
Over time, it is perfectly reasonable to expect that Intel will package up a Knights coprocessor on the chip package, or perhaps on the die itself, for its HPC customers. It has to do something with all of those transistors on future 14 nanometer processes after all.
Neal-Graves said that Intel initially had 10 partners helping it work with the Knights Ferry development platform last year, and ramped that up to 25 partners by the end of 2010. By the end of June, the plan is to have 50 partners, and the goal is to expand that to 100 by the end of this year. At the ISC event in Hamburg, Germany, this week, Intel is showing off servers and workstations using the Knights Ferry development platform, working with Silicon Graphics, Hewlett-Packard, IBM, Dell, Super Micro, and Colfax International. The Forschungszentrum Juelich (FZJ) and Leibniz Rechenzentrum (LRZ) labs in Germany, Centre Européenne pour la Recherche Nucléaire (CERN) in Switzerland, the National Center for Supercomputing Applications (NCSA) in the United States, and the (Korea Institute of Science and Technology Information in South Korea are all playing around with prototype Knights Ferry machines, testing their codes.
In one test, Colfax took one of its CXT8000 servers with two Intel Xeon X5690 chips (six cores running at 3.46GHz) and slapped in 24GB of memory and eight of the Knights Ferry coprocessors with the chips running at 1.2GHz. Using the Larrabee 1.6.197 kernel driver (see, it really is a Larrabee GPU), this 4U rack server was able to deliver 7.4 teraflops running the aforementioned SGEMM sorting benchmark. This was using alpha levels of future compilers and drivers from Intel.
Intel is not talking about the performance of the Knights Corner coprocessors, but if you do the math, a chip with somewhere between 50 and 64 cores running at between 1.2GHz and 1.5GHz should get you around 1 teraflops at single precision floating point math and half that on double precision. Crank it up to 2GHz, which the 22 nanometer process should allow before flames shoot out of the device, and then a 64-core MIC coprocessor gets you 2 teraflops single precision and the magic 1 teraflops at double precision, we estimate. ®
This article originally appeared in The Register.