Hyperion HPC User-Buyer Study: Demand for Sim-Analytics Systems, a Throughput Boom, FPGAs and AMD GPUs on the Move and Other Findings

Industry analyst firm Hyperion Research has completed its latest study of high performance computing buyers and users, it’s first since 2017, and the report reveals a quickly evolving and innovating industry in which, among other findings, end users are figuring out how to leverage the variety of compute architectures while also calling for HPC systems that can, under one roof, bridge the simulation-data analytics gap.

The study examines the resources, practices and purchasing plans of 194 government, academic and industrial HPC data centers hosting nearly 2,000 systems in 26 countries, Hyperion said  and it emphasizes private sector sites to enable deeper dives into specific industries.

Findings include:

Accelerating Processor Heterogeneity: 75 percent of the sites employ accelerators/coprocessors, according to Hyperion. The firm said Nvidia GPUs have a sizeable lead in this category, but AMD GPUs are gaining momentum.

“Nvidia is still dominant, there’s no doubt about it” Steve Conway, Hyperion senior advisor for HPC market dynamics, told us, “but AMD, from a small starting point, has been gaining pretty well and actually taken some business away from Nvidia… And one of their advantages is that their CPU and the GPU are both on the same silicon” (enabling users to avoid learning the GPU programming model) “so you don’t have as much back and forth (of data), and for some situations, that can be a very important advantage,” Conway said.

On the CPU side, AMD is gaining versus Intel for two reasons. “One is that the prices are generally lower, and second, the memory bandwidth is very, very good,” Conway said. “So AMD is in a good position right now but battling a very dominant x86 player and a very dominant GPU player.”

FPGA accelerators also are gaining HPC market acceptance, up from 5 percent adoption in 2017 to 18 percent now. Conway said users in some sectors, such as financial services, will take the trouble to optimize performance of their most important applications using this lightning fast architecture.

“This is a typical FPGA scenario: they have one application that’s more important than any other, it’s the application that nobody else has, and they run it 24 /7/365,” he said. “So what some investment banks have told us is they’ll port it to the GPU and get a 3X to 4X speed up. And that gives them confidence to take the time to port it to an FPGA, and then get a 30X to 40X speed up. It’s more work to port to an FPGA, but if the application is really important, and you run it all the time, in a lot of cases, it’s worth it.”

Compute Power Boom: Hyperion said its 2017 study showed peak performance for the sites’ largest HPC systems averaged 3.9 petaFLOPS; the new study shows the average almost quadrupled to 15.4PF.

Steve Conway, Hyperion Research

“That one is surprising, that it went up that fast and in just a few years,” Conway said. “The normal rate of performance increase in HPC has been about 1000 fold every 10 years. So to have systems that are almost four times as big in just a few years…is very surprising.”

Vendor Variety: While 51 percent of the surveyed end-user applications are still run on a single node, 28 percent of the sites have more than 15 HPC systems, pointing up not only the quantity but also the variety of systems in HPC sites.

Conway said there’s a common misconception among HPC vendors, particularly newer ones, that buyers purchase a new system only every few years and that they stick to one vendor.

“No, if you go to their data center, they probably have six different companies’ systems in their HPC data center,” said Conway, “so it doesn’t quite work like that with the big guys.”

All-in-1 Sim-Analytics: Conway said there’s growing market demand for HPC systems efficient for both compute-intensive simulation and data-intensive analytics.

“We’re in an interesting period,” Conway said, “because right now, most people have no choice except to run (analytics workloads) on their existing HPC systems, which over the past couple decades have become increasingly compute friendly and data unfriendly. That’s part of why GPUs have been so successful as kind of a plugin to help deal with that problem… Because if you bought a system today that is good at analytics, it’s way overkill, you’re paying way too much for your simulation.”

From the user perspective HPC workloads increasingly require simulation and analytics capabilities

“It turns out that most of the important AI use cases – whether it’s precision medicine or research on automated vehicles – benefit from concurrent simulation and analytics runs on the same workload,” Conway said, “so you really want an HPC system that is efficient (at both).”

This need is being addressed by new interconnect technologies, such as HPE-Cray’s Slingshot fabric, Intel’s CXL and Rockport Networks in Canada.

“You need a pretty good way of moving data around inside the system,” Conway said, “it’s about designing for data.”

Cloud Considerations: Data locality surpassed data security as the most-named barrier to exploiting cloud computing, Hyperion said.

Moving and storing large volumes data in clouds is a major limitation, said Conway, citing a big oil and gas company that some years ago built a new HPC center within 200 feet of its old one, ”and it took them almost three months to move the data 200 feet.”

“Data locality might be the last important barrier to HPC in the cloud,” said Conway. “Even the cloud services providers do not advise companies – if they have huge piles of data on premises close to some on-premises HPC resources – they don’t advise them moving that data to the cloud.” He said cloud vendors often pitch a multi-cloud services strategy they are “the point of control” between the customer’s on-prem compute resources and other clouds.

Cloud HPC Growth: Having said that, while Hyperion’s 2017 study showed that only 4 percent of the HPC sites ran more than half their HPC workloads in public/external clouds; in the new study, that figure grew to 12 percent.

Other findings:

  • C++ and Python are the leading parallel programming languages, but nearly half of the sites still use Fortran.
  • Infiniband continues to be the leading storage system backbone protocol, but Ethernet is rising quickly.

More information on the study can be found here.