GPU-powered HPC Workloads in the Cloud with AWS and NVIDIA

Print Friendly, PDF & Email

If you’re considering running your high performance computing (HPC) workloads in the cloud, you’re not alone. HPC in the cloud used to be seen as complementary to on-premises, especially when HPC users needed to burst their workloads to the cloud to meet demand for workload spikes. However, today, the cloud is a more integral GPU-powered computing environment for scientists and engineers. 

A recent study by Hyperion Research indicates that organizations are increasingly choosing the cloud to run their HPC workloads. In fact, nearly every organization adopting HPC resources is either using or is investigating the cloud to accelerate HPC workloads, enhance networking performance, and reduce operational costs. Hyperion projects that in 2026, the cloud market for HPC will reach $11.5 billion, growing to nearly half the size of the on-premises HPC server market.

Running HPC workloads in the cloud gives HPC users on-demand access to scalable GPU-accelerated compute infrastructure along with tools and services to accelerate HPC workload performance, and potentially lower costs.

Convergence of cloud, HPC, and AI/ML

A new category of HPC workloads is emerging. HPC users are adopting and integrating artificial intelligence (AI) and machine learning (ML) at increasingly higher rates. 

Multiple methods and models exist with large language models (LLMs) and a number of foundation models (FMs), drawing broad global interest from organizations.

Hyperion Research found that nearly 90 percent of HPC users are currently using or plan to use AI to enhance their HPC workloads. This includes hardware (processors, networking, data access), software (data management, queueing, developer tools), AI expertise (procurement strategy, maintenance, troubleshooting), and regulations (data provenance, data privacy, legal concerns).

As a result, organizations are experiencing a convergence of cloud, HPC, and AI/ML. Two simultaneous shifts are occurring: one toward workflows, ensembles, and broader integration; and another toward tightly coupled, high-performance capabilities. The outcome is closely integrated, massive-scale computing accelerating innovation across industries from automotive and financial services to healthcare, manufacturing, and beyond.

The evolution of HPC

The HPC market is expected to grow at a CAGR of over 5% between 2023 and 2032, with a projected value to reach $90 billion in 2032. Such growth is attributed to advancements in HPC technology and its increased application across industries.

Aligned with this growth is a move of HPC to the cloud as organizations look to address challenges such as compute infrastructure capacity limitations, power constraints, HPC workload fluctuation, and lengthy product development cycles.

On-demand GPU-accelerated computing with full-stack optimization, empower organizations to run more complex HPC simulations, to gain insights faster while maximizing resource efficiency.

Access to scalable compute capacity, GPU-accelerated AI and ML tools along with a purpose-built HPC platform and services means organizations can use scalable infrastructure to run HPC workloads more cost- and energy-efficiently in the cloud–helping to overcome bursting constraints.

How AWS and NVIDIA are solving HPC challenges

Amazon Web Services (AWS) and NVIDIA recognize the challenges that HPC users are facing. Together, they bring highly performant and easy-to-use solutions to enable organizations to accelerate HPC workloads and optimize costs in an energy-efficient way. 

Organizations can overcome many HPC challenges by accessing flexible GPU-accelerated compute capacity and a variety of purpose-built tools. AWS support service levels offer elasticity from 100 instances to 1,000 instances and beyond in minutes, so that waiting in queues is reduced.

Scientists and engineers can dynamically scale their HPC infrastructure according to demands and budgets. Organizations can use Amazon Elastic Compute Cloud (Amazon EC2) instances, powered by NVIDIA GPUs, to enable accelerated computing to increase the speed of their workloads and perform more computing in the same amount of time, with compute, networking, and storage capacity architected for a diverse set of HPC workloads. 

NVIDIA GPU-powered instances on AWS are capable of running the largest models and datasets. In combination with SDKs such as the NVIDIA HPC SDK GPU-Optimized AMI and developer libraries supporting all major AI frameworks, organizations have a better foundation and support in developing highly-performant HPC infrastructure.

Organizations can adapt quickly to their changing business demands and budgets with AWS on-demand infrastructure and pay-as-you-go pricing models. HPC users can choose to pay only for the GPU-accelerated compute capacity they need with flexibility for short or long-term commitments. 

Organizations can maximize their uptime and cloud resources by defining jobs and environments with AWS Batch and achieve higher flexibility by running their largest HPC applications anywhere with NICE DCV for remote visualization.

In addition, AWS and NVIDIA announced a strategic collaboration to offer new supercomputing infrastructure, software, and services to supercharge HPC, design and simulation workloads, and generative AI. This includes NVIDIA DGX Cloud coming to AWS and Amazon EC2 instances powered by NVIDIA GH200 Grace Hopper Superchip, H200, L40S and L4 GPUs.

Future-ready, with NVIDIA on AWS

With a full range of purpose-built solutions and services provided by AWS and NVIDIA, HPC users can run large scale applications and simulations to meet intensive compute demands. As AI technologies continue to create more complex and compute-intensive HPC workloads, AWS and NVIDIA can help organizations overcome the compute challenges faced today and in the decades to come. 

Learn more about how AWS and NVIDIA accelerate HPC workloads