Exascale Exasperation: Why DOE Gave Intel a 2nd Chance; Can Nvidia GPUs Ride to Aurora’s Rescue?

The most talked-about topic in HPC these days – i.e., another Intel chip delay and therefore delay of the U.S.’s flagship Aurora exascale supercomputer – is something no one directly involved wants to talk about. Not Argonne National Laboratory, where Intel was to install Aurora in 2021; not the Department of Energy’s Exascale Computing Project, guiding development of a “capable exascale ecosystem”; and not DOE itself. As for Intel, a spokesperson earlier this week promised to “circle back shortly,” but hadn’t as of press time.

In lieu of information (other than rote answers from public relations Q&A sheets issued three weeks ago by Intel and Argonne), the HPC community is left to speculate on how those responsible for Aurora got into this fix, what it means for the U.S. exascale strategy and what DOE and Intel can do about it.

By way of background, on July 23, Intel disclosed that its 7nm Ponte Vecchio GPU, planned to be integrated with Intel Xeon CPUs in Aurora, will be delayed six months. The general reaction could be described as shocked but not surprised.

Shocked due to the high-stakes, strategic importance of Aurora for the U.S. exascale effort, a multi-year, multi-billion dollar project critical to the technological prowess of the U.S. relative to its chief geopolitical rival, China, which is scheduled to stand up an exascale system next year, along with the the EU and Japan.

Not surprised due to Intel’s woeful record of 10nm and 7nm misses, along with its failure two years ago to deliver the pre-exascale Aurora (A18) system to Argonne (for more background see “Another Intel 7nm Chip Delay – What Does it Mean for Aurora Exascale?”).

“Over the past five years, Intel has unleashed a plague of delays and decommitments on HPC users,” said industry analyst Addison Snell, CEO of Intersect360 Research. “Aurora was originally supposed to be a pre-exascale system in 2018, but it had to be redefined after Intel’s cancellations of the Xeon Phi processor and the OmniPath interconnect. Intel needed to deliver Ponte Vecchio and Aurora on-time and on-spec (using revised definitions of both terms), just to save face. If Argonne has to wait until 2022 for delivery, this exascale supercomputer will become an embarrassing afterthought in the shadow of the Frontier and El Capitan systems elsewhere within the DOE, to say nothing of whatever gets done outside the U.S.”

Addison Snell of Intersect360 Research

Swap Nvidia for Intel GPUs?

Snell suggested that, in the absence of Intel’s 7nm Ponte Vecchio GPU, Nvidia GPUs be coupled with Intel CPUs within Aurora.

“At this point, the best solution for Argonne, for U.S. exascale efforts, and for the American taxpayer, may be for Intel to eat crow and open the door to Nvidia to provide the GPU components of Aurora,” Snell told us. “Nvidia is the leading GPU provider by miles. Nvidia has proven success on the pre-exascale CORAL systems, and it would give the DOE a potential opportunity to preserve optimizations in CUDA without relying on alternate GPUs from Intel or AMD.”

“Intel placed a big bet in becoming the prime vendor on this contract,” Snell added, “and with that comes the responsibility to deliver something. The DOE should insist on delivery in mid-2021, even if it means Intel has to put in someone else’s GPUs in order to fulfill the terms of the contract. Whatever Argonne and the DOE decide to do, it should be clear that Intel has used up its second chances with Aurora.”

Snell argued there’s no compelling technical reason why both CPUs and GPUs have to be Intel parts.

“AMD offers coherency benefits in combining AMD CPUs and GPUs over a common Infinity Fabric, but we have not heard the same from Intel with regard to its own processors and its planned Ponte Vecchio GPU,” he said. “If Intel is serious about OneAPI (cross-architecture programming model), it shouldn’t matter from a programming perspective what GPU it is.”

But Karl Freund, senior analyst, machine learning & HPC, at Moor Insights & Strategy, doubted whether swapping Nvidia for Intel GPUs could work.

Karl Freund, Moor Insight & Strategy

“Here’s the problem, the common thread across all three U.S. DOE exascale deployments is that they’re tightly integrated CPU-GPU complexes,” he said. “So the GPU is not sitting on a PCIe card… It’s not like an APU that you’d find in a laptop where the CPU-GPU is actually on the same package, but it is using the same concept, the native SMP fabric of the CPU is talking directly to the GPU. That’s true for both Ponte Vecchio-Xeon and for AMD’s next generation Radeon (GPU)-Epyc (CPU).”

Replacing Nvidia for Intel GPUs, Freund said, would mean higher latencies and slower performance because Xeon “only speaks PCIe to an Nvidia link.”

The result: DOE and Argonne don’t have much choice but to stick with its prime, Intel.

“They just have to stay the course and recognize they’re (Argonne) not going to be the first exascale,” said Freund.  “And larger exascales (Frontier at Oak Ridge National Lab and El Capitan at Lawrence Livermore National Lab) are going to beat them with AMD technology. So they (Argonne) could say, ‘Well, okay, do we want to step back and redesign Aurora to be larger?’ Because it’s not very interesting when you’re third – let’s assume Aurora will be DOE’s third exascale – and it’s slower than your first two. That’s not cool.”

Questions about Intel

Intel’s failures have called into question not only the company’s ability to execute but also the stability of the management team involved in HPC technologies. Departures among senior managers include:

  • Alan Gara, who led development of Intel’s heavily promoted OmniPath high performance fabric, abandoned last year
  • Raj Hazra, former corporate VP/GM, Enterprise & Government, Data Center Group, who left Intel in November and is now at Micron
  • Charles Wuischpard, former VP of Intel’s Datacenter Group and is now CEO of Ayar Labs
  • Daniel McNamara, formerly Intel’s president/GM of the Network and Custom Logic Group and SVP of the Programmable (i.e., FPGA) Solutions Group, who is now at AMD
  • Going back three-plus years ago, Diane Bryant left her role as group president of the Data Center Group

Speculation is spreading that more changes could be on the way.

“They seem to be disbanding, maybe reconstructing, but disbanding, what was their flagship team for developing trans-exascale machines,” said a leading HPC authority, “so they seem to be drifting backwards. There are some lower level people, and I don’t know if they’ve been announced or not. And I think there are a couple of cases up in the air.”

A more lenient view of Intel’s Aurora-related challenges comes from industry analyst firm Hyperion Research, whose Senior Adviser, HPC Market Dynamics Steve Conway told us in July that if shipment of Aurora is moved to late 2021 or into 2022 “it’s not a major delay – but a delay all the same.” He also downplayed Intel’s difficulty in fixing their 7nm process causing the Ponte Vecchio delay.

Bob Sorensen, Hyperion Research

Conway colleague Bob Sorensen, Hyperion Senior VP of Research, argued that, to a degree, delays in leading edge supercomputers should be expected when building advanced systems. While he agreed that “it’s pretty clear that no one at Intel wants to be the guy to write the press release that says, guess what, we’re showing some delays,” he also said “we’re at a place in semiconductor manufacturing where this is no longer simply engineering where you’re turning the crank to move down to the next node.”

“There are new technologies required every step of the way now, so it’s hard, really hard,” Sorensen said. “I’m not defending Intel and I’m not saying that nobody messed up here. But to me, these are pretty aggressive systems, new architectures. There should be some allowance for slips in the schedule. If every machine came off exactly like clockwork, like it was supposed to, you’d have to ask the question: are we sufficiently pushing the state of the art here? Or have we become too risk averse in our new architecture? …that’s the nature of the beast when you’re pushing the envelope.”

Conway’s and Sorensen’s view of Intel, however, seems to be a minority one. One source told us that a senior DOE official involved in exascale has told colleagues he’s “never been so angry.” In part this is due to the second chance Intel was given after the failure of pre-exascale Aurora. But here, too, there’s a countervailing view that Intel was granted Aurora A21 because AMD, in partnership with Cray, won two of DOE’s other three initial exascale contracts

“DOE has never liked to put all of our eggs in one basket,” Moor Insights’ Freund said, “just like back in the days when they’d switch back and forth between IBM and Intel and then Nvidia…. I think some of that was in play here where they already foresaw that because AMD would get the other exascales, even though Argonne had already been awarded to Intel, they said we can’t put all our eggs in the AMD basket, that’s just too risky. And it’s not good for U.S. industry to stifle competition like that. So I think they probably had both practical and altruistic objectives by keeping that (A21) with Intel.”

Indigenous Technology

Another Intel advantage: the growing value placed by leading technological countries on “indigenous technology.” While Intel is a domestic fab, AMD outsources chip production to Taiwan-based TSMC. But that advantage was undermined last month by Intel’s disclosure that it may partially outsource production of 7nm chips to a third-party semiconductor foundry – assumed to be TSMC (Samsung is another possibility), whose 7nm CPUs and GPUs for AMD have given that company price/performance leads enabling it to take HPC and data center processor market share from Intel.

But in the latest twist, TSMC announced in May it would build a fab in Arizona, which could have implications for the growing U.S.-China rivalry.

“There is a concern about the lack of indigenous competitive advanced node process,” Freund said. “I mean, Intel can’t do it. Samsung can, their fab’s in Austin, and now with TSMC building a plant in Arizona there could be indigenous production capability from TSMC as well, though it’s still out there a ways.

“But there’s real concern as U.S.-Chinese relations continue to deteriorate,” said Freund. “Should China make a significant blunder and try to isolate Taiwan, which would require major military action, should they do that, the U.S. is at risk. We are absolutely exposed right now. And Chinese leaders are smart guys. They know that. And so if we push hard enough, they might just push back.”

Comments

  1. There is most certainly coherency between CPU and GPU for Ponte Vecchio. One of the key points for the Aurora Supercomputer on the Argonne page is “Unified memory architecture across CPU and GPU”. Probably the main reason Sapphire Rapids (the CPU in A21) is so early to PCIe 5.0 is because CXL, Intel’s coherent fabric for accelerators on their bus, is tied to PCIe 5.0.

    NVIDIA is a CXL member, but NVIDIA will be mid-generation in 2021. Intel would need to use NVIDIA’s A100 GPUs and the A100 doesn’t have CXL. NVIDIA would need to make a special version of the chip, if they had the time and inclination to do so. And, being a prior generation chip, it would use a lot of power to reach Intel’s promised performance targets. It was always a risky bet that Intel could reach such a target by themselves, anyway, when they had no experience with such high performance GPUs. They were relying on some cutting-edge process (Intel’s 7 nm is beyond TSMC’s 7 nm, 7+, or 6 nm) to make up for the immaturity of their architecture. And even so they were promising to be ready a year earlier than NVIDIA or AMD. But Intel was already struggling mightily for years with their previous generation process (10 nm). I don’t see why the DOE should be so mad at Intel suddenly. Sure Intel made some promises they couldn’t follow through on, but they had been doing that for years with Xeon Phi and Omnipath. The DOE is the one who made the bet, knowing all of the above.

    Let’s hope their bets with the AMD systems turn out better. They are less risky bets but there is still some risk. The risky thing is the DOE is spending over a billion and a half dollars on a surge of compute capacity and trying to do all of it on the cheap with less-proven components. They didn’t leave themselves any escape valve at all. I suppose they could still procure the A22 system under Coral-2, but that seems like a choice between letting Aurora slip to 2022 or canceling Aurora and getting A22, unless they can convince Congress to give them the money for a 4th $500M+ system. But, hey, these days you go up a trillion, we go down a trillion, what’s half a billion?

  2. “By way of background, on July 23, Intel disclosed that its 7nm Ponte Vecchio GPU, planned to be integrated with Intel Xeon CPUs in Aurora, will be delayed six months. ”

    actually Intel didn’t say that. Here is the quote from their earnings CC. It only references their 7nm CPUs, which were already scheduled for 2022. Looks like they have a plan to meet the schedule for the GPUs.

    “We are seeing an approximate six-month shift in our seven-nanometer-based CPU product timing relative to prior expectations. “

  3. ““AMD offers coherency benefits in combining AMD CPUs and GPUs over a common Infinity Fabric, but we have not heard the same from Intel with regard to its own processors and its planned Ponte Vecchio GPU,””

    It is exact opposite. Intel’s Sapphire Rapids CPU is planned to connect to the GPUs, or to their Rambo Cache, via CXL. The main feature of that plan is to use the asymmetric cache coherency of CXL. Intel announced yesterday that Sapphire Rapids is in the lab and will be sampling soon.

    AMD’s 2021 roadmap shows interconnects between CPU and GPU via PCIE4. The GPUs interconnect via Infinity Fabric, but there is no cache coherency between CPU and GPU on the roadmap with IF until the third generation. This info is from the FAD2020 slides. See the Anandtech article named “amd-moves-from-infinity-fabric-to-infinity-architecture-connecting-everything-to-everything” .

  4. perhaps a stupid question, but why did ANL pick this solution instead of an Nvidia-Ampere-based one

    • Politics. Intel promised that both CPU and GPU will be made in the USA. A promise that could not be kept.