Nov. 26, 2024: AMD today announced the release of ROCm Version 6.3 open-source platform, introducing tools and optimizations for AI, ML and HPC workloads on AMD Instinct GPU accelerators.
ROCm 6.3 is engineered for a range of organizations, from AI startups to HPC-driven industries, and is designed to enhance developer productivity
Features of this release include SGLang integration for AI inferencing, a re-engineered FlashAttention-2 for AI training and inference, the introduction of multi-node Fast Fourier Transform (FFT) for HPC workflows and other features:
1. SGLang in ROCm 6.3: Inferencing of Generative AI (GenAI) Models
GenAI is transforming industries, but deploying large models often means grappling with latency, throughput, and resource utilization challenges. Enter SGLang, a new runtime supported by ROCm 6.3, purpose-built for optimizing inference of cutting-edge generative models such as LLMs and VLMs on AMD Instinct GPUs.
Why It Matters to You:
- 6X Higher Throughput: Achieve up to 6X higher performance on LLM inferencing compared to existing systems as researchers have found1, enabling your business to serve AI applications at scale.
- Ease of Use: Python™-integrated and pre-configured in the ROCm Docker containers enable developers to accelerate deployment for interactive AI assistants, multimodal workflows, and scalable cloud backends with reduced setup time.
Whether you’re building customer-facing AI solutions or scaling AI workloads in the cloud, SGLang delivers the performance and ease-of-use needed to meet enterprise demands. Discover the powerful features of SGLang and learn how to seamlessly set up and run models on AMD Instinct GPU accelerators here > Get started now!
2. Transformer Optimization: Re-Engineered FlashAttention-2 on AMD Instinct™
Transformer models are at the core of modern AI, but their high memory and compute demands have traditionally limited scalability. With FlashAttention-2 optimized for ROCm 6.3, AMD addresses these pain points, enabling faster, more efficient training and inference2.
Highlights:
- 3X Speedups: Achieve up to 3X speedups on backward pass and a highly efficient forward pass compared to FlashAttention-12, accelerating model training and inference to reduce time-to-market for enterprise AI solutions.
- Extended Sequence Lengths: Efficient memory utilization and reduced I/O overhead make handling longer sequences on AMD Instinct GPUs seamless.
Optimize your AI pipelines with FlashAttention-2 on AMD Instinct GPU accelerators today, seamlessly integrated into existing workflows through ROCm’s PyTorch container with Composable Kernel (CK) as the backend.
3. AMD Fortran Compiler: Bridging Legacy Code to GPU Acceleration
Enterprises running legacy Fortran based HPC applications can now unlock the power of modern GPU acceleration with AMD Instinct™ accelerators, thanks to the new AMD Fortran compiler introduced in ROCm 6.3.
Benefits:
- Direct GPU Offloading: Leverage AMD Instinct GPUs with OpenMP offloading, accelerating key scientific applications.
- Backward Compatibility: Build on existing Fortran code while taking advantage of AMD’s next-gen GPU capabilities.
- Simplified Integrations: Seamlessly interface with HIP Kernels and ROCm Libraries, eliminating the need for complex code rewrites.
Enterprises in industries such as aerospace, pharmaceuticals, and weather modeling can now future proof their legacy HPC applications, realizing the power of GPU acceleration without the need for extensive code overhauls previously required. Get started with the AMD Fortran Compiler on AMD Instinct GPUs through this detailed walkthrough.
4. New Multi-Node FFT in rocFFT: For HPC Workflows
Industries relying on HPC workloads—from oil and gas to climate modeling—require distributed computing solutions that scale efficiently. ROCm 6.3 introduces multi-node FFT support in rocFFT, enabling high-performance distributed FFT computations.
Why It Matters for HPC:
- Built-in Message Passing Interface (MPI) Integration: Simplifies multi-node scaling, helping reduce complexity for developers and accelerating the enablement of distributed applications.
- Leadership Scalability: Scale seamlessly across massive datasets, optimizing performance for critical workloads like seismic imaging and climate modeling.
Organizations in industries like oil and gas and scientific research can now process larger datasets with greater efficiency, driving faster and more accurate decision-making.
5. Computer Vision Libraries: AV1, rocJPEG, and Beyond
AI developers working with modern media and datasets require efficient tools for preprocessing and augmentation. ROCm 6.3 introduces enhancements to its computer vision libraries, rocDecode, rocJPEG, and rocAL, empowering enterprises to tackle diverse workloads from video analytics to dataset augmentation.
Why It Matters:
- AV1 Codec Support: Cost-effective, royalty-free decoding for modern media processing via rocDecode and rocPyDecode.
- GPU-Accelerated JPEG Decoding: Seamlessly handle image preprocessing at scale with built-in fallback mechanisms that come with rocJPEG library.
- Better Audio Augmentation: Improved preprocessing for robust model training in noisy environments with rocAL library.
From media and entertainment to autonomous systems, these features enable developers to create better AI-advanced solutions for real-world applications.
Beyond these standout features, it’s worth highlighting that Omnitrace and Omniperf, introduced in ROCm 6.2, have been rebranded as ROCm System Profiler and ROCm Compute Profiler. This rebranding will help with enhanced usability, stability and seamless integration into the current ROCm profiling ecosystem. ROCm 6.3?
AMD ROCm has been making strides with every release, and version 6.3 is no exception. It delivers cutting-edge tools to simplify development while driving better performance and scalability for AI and HPC workloads. By embracing the open-source ethos and continuously evolving to meet developer needs, ROCm empowers businesses to innovate faster, scale smarter, and stay ahead in competitive industries.
More information is at: ROCm Documentation Hub
Contributors:
Jayacharan Kolla – Product Manager
Aditya Bhattacharji – Software Development Engineer
Ronnie Chatterjee – Director Product Management
Saad Rahim – SMTS Software Development Engineer
1https://arxiv.org/pdf/2312.07104 – at p.8
2Based on informal internal testing conducted for specific customer/s, performance for FlashAttention-2 has demonstrated 2-3X of performance uplift vs the previous version of FlashAttention-1 results. Please note that performance can vary depending on individual system configurations, workloads, and environmental factors. This information is provided solely for illustrative purposes and should not be interpreted as a guarantee of future performance in all use cases.