AWS Announces GA of EC2 Trn1 Instances for ML Model Training

SEATTLE — Oct. 10, 2022 — Amazon Web Services today announced the general availability of Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances powered by AWS-designed Trainium chips.

Trn1 instances are built for high-performance training of machine learning models in the cloud. AWS said the offering saves up to 50 percent cost-to-train savings over comparable GPU-based instances, “enabling customers to reduce training times, rapidly iterate on models to improve accuracy and increase productivity for workloads like natural language processing, speech and image recognition, semantic search, recommendation engines, fraud detection and forecasting.”

The company said there are no minimum commitments or upfront fees to use Trn1 instances, customers pay for the amount of compute used.

The instances join AWS’s compute offerings with hardware for machine learning, including Inf1 instances with AWS-designed Inferentia chips, G5 instances, P4d instances, and DL1 instances. AWS Neuron, the software development kit (SDK) for Trn1 instances, is designed to help customers to get started with minimal code changes and is integrated into such ML frameworks as PyTorch and TensorFlow.

Trn1 instances provide up to 16 AWS Trainium accelerators for deploying deep learning models. Trn1 instances deliver up to 800 Gbps of networking bandwidth (lower latency and 2x faster than the latest EC2 GPU-based instances) using the second generation of AWS’s Elastic Fabric Adapter (EFA) network interface for scaling efficiency, according to AWS.

Trn1 instances also use NeuronLink, an intra-instance interconnect designed for faster training. Customers deploy Trn1 instances in Amazon EC2 UltraClusters consisting of tens of thousands of Trainium accelerators “to rapidly train complex deep learning models with trillions of parameters,” the company said. “With EC2 UltraClusters, customers will be able to scale the training of machine learning models with up to 30,000 Trainium accelerators interconnected with EFA petabit-scale networking, which gives customers on-demand access to supercomputing-class performance to cut training time from months to days.”

Trn1 instance support up to 8 TB of local NVMe SSD storage for access to large datasets. AWS Trainium supports a range of data types (FP32, TF32, BF16, FP16, and configurable FP8) and stochastic rounding, a way of rounding probabilistically that enables high performance and higher accuracy. AWS Trainium also supports dynamic tensor shapes and custom operators to deliver a flexible infrastructure designed to evolve with customers’ training needs.

“Over the years we have seen machine learning go from a niche technology used by the largest enterprises to a core part of many of our customers’ businesses, and we expect machine learning training will rapidly make up a large portion of their compute needs,” said David Brown, vice president of Amazon EC2 at AWS. “Building on the success of AWS Inferentia, our high-performance machine learning chip, AWS Trainium is our second-generation machine learning chip purpose built for high-performance training. Trn1 instances powered by AWS Trainium will help our customers reduce their training time from months to days, while being more cost efficient.”

Trn1 instances are built on the AWS Nitro System, a collection of AWS-designed hardware and software innovations that streamline the delivery of isolated multi-tenancy, private networking, and fast local storage. The AWS Nitro System offloads the CPU virtualization, storage, and networking functions to dedicated hardware and software, delivering performance that is nearly indistinguishable from bare metal. Trn1 instances will be available via additional AWS services including Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS Batch. Trn1 instances are available for purchase as On-Demand Instances, with Savings Plans, as Reserved Instances, or as Spot Instances. Trn1 instances are available today in US East (N. Virginia) and US West (Oregon), with availability in additional AWS Regions coming soon.

Amazon’s product search engine indexes billions of products, serves billions of customer queries daily. “We are training large language models that are multi-modal, multilingual, multi-locale, pre-trained on multiple tasks, and span multiple entities (products, queries, brands, reviews, etc.) to improve the customer shopping experience,” said Trishul Chilimbi, senior principal scientist at Amazon Search. “Amazon EC2 Trn1 instances provide a more sustainable way to train large language models by delivering the best performance/watt compared to other accelerated machine learning solutions and offers us high performance at the lowest cost. We plan to explore the new configurable FP8 datatype and hardware accelerated stochastic rounding to further increase our training efficiency and development velocity.”

PyTorch is an open source machine learning framework that accelerates the path from research prototyping to production deployment. “At PyTorch, we want to accelerate taking machine learning from research prototyping to production ready for customers. We have collaborated extensively with AWS to provide native PyTorch support for new AWS Trainium-powered Trn1 instances. Developers building PyTorch models can start training on Trn1 instances with minimal code changes,” said Geeta Chauhan, Applied AI, engineering manager at PyTorch. “Additionally, we have worked with the OpenXLA community to enable PyTorch Distributed libraries for easy model migration from GPU-based instances to Trn1 instances. We are excited about the innovation that Trn1 instances bring to the PyTorch community, including more efficient data types, dynamic shapes, custom operators, hardware-optimized stochastic rounding, and eager debug mode. All these capabilities make Trn1 well suited for wide adoption by PyTorch developers, and we look forward to future joint contributions to PyTorch to further optimize training performance.”

Helixon builds next-generation artificial intelligence (AI) solutions to protein-based therapeutics, developing AI tools that empower scientists to decipher protein function and interaction, interrogate large-scale genomic datasets for target identification, and design therapeutics such as antibodies and cell therapies. “Today, we use training distribution libraries like Fully Sharded Data Parallel to parallelize model training over many GPU-based servers, but this still takes us weeks to train a single model,” said Jian Peng, CEO at Helixon. “We are excited to utilize Amazon EC2 Trn1 instances featuring the highest networking bandwidth available on AWS to improve the performance of our distributed training jobs and reduce our model training times, while also reducing our training costs.”

Money Forward, Inc. serves businesses and individuals with an open and fair financial platform. “We launched a large-scale AI chatbot service on the Amazon EC2 Inf1 instances and reduced our inference latency by 97% over comparable GPU-based instances while also reducing costs. As we keep fine-tuning tailored natural language processing models periodically, reducing model training times and costs is also important,” said Takuya Nakade, CTO at Money Forward. “Based on our experience from successful migration of inference workload on Inf1 instances and our initial work on AWS Trainium-based EC2 Trn1 instances, we expect Trn1 instances will provide additional value in improving end-to-end machine learning performance and cost.”

Magic is an integrated product and research company developing AI that feels like a colleague to make the world more productive. “Training large autoregressive transformer-based models is an essential component of our work. AWS Trainium-powered Trn1 instances are designed specifically for these workloads, offering near-infinite scalability, fast inter-node networking, and advanced support for 16-bit and 8-bit data types,” said Eric Steinberger, co-founder and CEO at Magic. “Trn1 instances will help us train large models faster, at a lower cost. We are particularly excited about the native support for BF16 stochastic rounding in Trainium, increasing performance while numerical accuracy indistinguishable from full precision.”

Sponsored Guest Articles

Cabinet Technology is Bridging the Efficiency of Air Cooling with the Performance of Liquid Cooling for HPC and AI Workloads

White Papers

Faster CFD Solution Times via the Cloud: The Power of Ansys Fluent and Ansys Gateway powered by AWS

Featured RSS Feed

More News from insideAI News