Third-party performance benchmarks show CPUs with HBM2e memory now have sufficient memory bandwidth and computational capabilities to match GPU performance on many HPC and AI workloads.
By Rob Farber
Recent Intel and third-party benchmarks now provide hard evidence that the upcoming Intel® Xeon® processors codenamed Sapphire Rapids with high bandwidth memory (fast, high bandwidth HBM2e memory) and Intel® Advanced Matrix Extensions can match the performance of GPUs for many AI and HPC workloads. For many memory bandwidth bound applications, GPUs no longer outperform CPUs simply because they have a faster memory system. HBM2e memory for CPUs levels the playing field, making architectural differences the deciding factor in deciding which of the two devices is the preferred platform for a given workload.
The numbers tell the story as the addition of 64GB of on-package HBM2e memory increases bandwidth available to the forthcoming Intel Xeon processors codenamed Sapphire Rapids with HBM2e high bandwidth memory to approximately 1.22 TB/s, or by four times when compared to a similar Intel Xeon CPU with eight DDR5-4800 channels. [1]
Achieving GPU Performance on AI Workloads
The performance of bandwidth limited workloads is obviously dictated by the bandwidth of the memory subsystem. Historically, GPUs memory systems could deliver significantly more bytes per second to their computational units, which gave GPUs the ability to run many — but not all — memory bandwidth limited workloads much faster than CPUs. (The “not all” is important to HPC workloads and will be discussed below.)
This history made AI the de facto workload performance winner for GPUs.
GPUs are fantastic for running data parallel workloads where every thread executes the same instruction in lockstep. This is referred to as the single instruction, multiple data execution (SIMD) architecture. For example, the very time-consuming training of AI models provides an excellent example of the trade-offs between computation, memory performance, and amount of parallelism in the problem. [2] Bulk inference operations (e.g., using the trained model to make predictions) also exhibit the same trade-offs.
Correspondingly, modern floating-point units have significantly outpaced memory performance for decades. [3] This explains why the training and evaluation of many AI models can very quickly become memory bandwidth bound because the computational units can process data faster than the memory system can fetch data.
The advent of deep learning quickly increased the complexity of the AI models and the types of computations performed inside the AI model. In response, GPU vendors added special instructions to their instruction set architectures (ISA) to speed the processing of the popular deep-learning models. Convolutional neural networks are one example. The advent of hardware accelerated reduced-precision arithmetic is another. Such ISA changes, along with the unavoidable fact that training more complex models generally requires big increases in the training set size [4] again made memory bandwidth the gating performance factor for many AI workloads.
The forthcoming Intel Xeon Processors codenamed Sapphire Rapids with high bandwidth memory have made all this past history as Intel has incorporated these technology advantages into a CPU solution. As can be seen in the AlphaFold2 benchmark results below, the combination of HBM2e memory and Intel® Advanced Matrix Extensions (Intel® AMX) provide a massive performance uplift on an important problem. According to the National Library of Medicine, “AlphaFold2 was the star of CASP14, the last biannual structure prediction experiment. Using novel deep learning, AF2 predicted the structures of many difficult protein targets at or near experimental resolution.” [5]
The AlphaFold2 benchmark results also show that the Intel Xeon processors codenamed Sapphire Rapids with high bandwidth memory also outpace other CPUs. Vikram Saletore (director and principal engineer in the Super Compute Group at Intel) explains, “The AphaFold2 results reflect how good these processors will be for both cloud and HPC users”.
The DeepCAM training performance reported by Intel at ISC’22 demonstrates faster performance compared to an NVIDIA A100 80GB GPU. Overall, the DeepCAM results (bottom graph below) show an up to 3.6× improvement over the baseline EPYC 7763. [6] The modified DeepLab v3 AI model with 57 million trainable parameters and is trained on batches of very large 2D images, 64MB each. These benchmarks have a significant PCIe and I/O component during training that affect GPU performance. For these reasons, the top graph (below) illustrates the many hours of machine time that can be saved.
A General Across the Board 2× to 3× Faster HPC Performance
In his ISC’22 keynote, Intel’s Jeff McVeigh (vice president and general manager of the Super Compute Group at Intel Corporation) cited significant performance improvements across key HPC use cases. “When comparing 3rd Gen Intel Xeon Scalable processors to the upcoming Intel Xeon Processors codenamed Sapphire Rapids with high bandwidth memory processors,” McVeigh observed, “we are seeing two- to three-times performance increases across weather research, energy, manufacturing, and physics workloads” [7] [8]
When comparing 3rd Gen Intel® Xeon® Scalable processors to the upcoming Intel Xeon Processors codenamed Sapphire Rapids with high bandwidth memory, we are seeing two- to three-times performance increases across weather research, energy, manufacturing, and physics workloads. – Jeff McVeigh, vice president and general manager of the Super Compute Group at Intel Corporation
Commercial HPC is Included – Ansys Fluent
Ansys Fluent is a particularly high-profile use case that is of interest to many commercial HPC users. Fluent is an industry-leading fluid simulation software that is known for its advanced physics modeling capabilities and high accuracy.
At ISC’22, Ansys CTO Prith Banerjee summarized the Ansys results by noting that the Intel Xeon Processors codenamed Sapphire Rapids with high bandwidth memory delivered up to 2× performance increase on real-world workloads from Ansys Fluent and ParSeNet. [10] [11] This speedup is reflected across a broad spectrum of Fluent benchmarks:This is only a beginning as Wim Slagter, Strategic Partnerships Director at Ansys, explains, [13] “Engineers are continuously challenged to innovate better and faster. To address these challenges, we are excited to see Intel driving HPC to new heights with the Intel Xeon Processors codenamed Sapphire Rapids with high bandwidth memory. In early testing of our Fluent CFD software on these processors, we are seeing up to 2.2× performance gains over the previous generation of Intel Xeon Platinum processors due to extremely high memory bandwidth from HBM as well as AVX-2 support and high core frequency.” [14]
We are excited to see Intel driving HPC to new heights with the Intel Xeon Processors codenamed Sapphire Rapids with high bandwidth memory. In early testing of our Fluent CFD software on these processors, we’re seeing up to 2.2X performance gains over the previous generation of Intel Xeon Platinum processors due to extremely high memory bandwidth from HBM as well as AVX-2 support and high core frequency. – Wim Slagter, Strategic Partnerships Director at Ansys
The ParSeNet performance challenges GPU AI dominance with a 1.8× speedup in training performance compared to an NVIDIA A100 GPU. ParSeNet is a parametric surface fitting network for 3d point clouds.Looking Ahead to Integrated AI and HPC Workloads
The combination of HBM2e memory and Intel’s ISA improvements can make these processors the preferred AI and HPC platform. Based on industry trends, Saletore takes this further when he predicts, “After a lot of talk and little movement over recent years, we’re starting to see real movement towards the integration of AI into HPC workloads. With the upcoming Intel Xeon Processors codenamed Sapphire Rapids with high bandwidth memory processors, Intel makes a compelling argument that our CPUs are not just viable for this new integrated class of workloads, but these processors are the preferred platform for these workloads.”
After a lot of talk and little movement over recent years, we’re starting to see real movement towards the integration of AI into HPC workloads. With the upcoming Intel Xeon Processors codenamed Sapphire Rapids with high bandwidth memory, Intel makes a compelling argument that our CPUs are not just viable for this new integrated class of workloads, but these processors are the preferred platform for these workloads. — Vikram Saletore
Saletore’s belief is solidly grounded in seminal research by CERN and other research organizations. CERN, for example, demonstrated that AI-based models can act as orders-of-magnitude-faster replacements for computationally expensive tasks in an HPC simulation, while still maintaining a remarkable level of accuracy. [16]
Cosmoflow results using Tensorflow shown in Figure 5 (and ParSeNet shown previously in Figure 4) support Saletore’s prediction.
The CosmoFlow training application benchmark is part of the MLPerf HPC benchmark suite. It involves training a 3D convolutional neural network for N-body cosmology simulation data to predict physical parameters of the universe. [17] The benchmark built on top of the TensorFlow framework makes heavy use of convolution and pooling primitives. [18] In total, the model contains approximately 8.9 million trainable parameters and the model is trained on batches of large 3D images ~3MB each.Match the Parallelism of the Problem to the Device
A huge challenge with GPUs lies in mapping HPC applications and numerical algorithms to the GPU SIMD architecture. The fundamental problem is that SIMD architectures (e.g., GPUs) impose limitations that don’t exist when programming the general-purpose MIMD architecture implemented by CPUs. Further, many HPC and AI workloads do not need the massive parallelism of GPUs. With Intel Xeon processors codenamed Sapphire Rapids with high bandwidth memory processors, users need only match the parallelism and architecture of the device to the workload.
Conclusion
For many cloud and on-premises datacenters, giving users access to GPU-accelerated performance without requiring specific GPU-enabled software is very attractive and cost-effective solution. Third-party benchmarks independently confirm that the upcoming Intel Xeon processors codenamed Sapphire Rapids with high bandwidth memory processors and Intel Advanced Matrix Extensions can match the performance of GPUs for many HPC and AI workloads. At the same time, Intel benchmarks demonstrate that Intel Xeon processors codenamed Sapphire Rapids with high bandwidth memory are an attractive solution because they can deliver an across-the-board 2× to 3× increase in HPC and AI workload performance compared to the previous 3rd generation Intel Xeon Scalable processors.
Rob Farber is a global technology consultant and author with an extensive background in high-performance computing and machine learning technology.
