PCIe 7.0: Enabling Next Gen AI Accelerator Interconnects in HPC Data Centers

By Priyank Shukla, Principal Product Manager, Synopsys

Data centers face increasing challenges in processing more intricate, complex and compute-intensive workloads. Large language models demand immense computational power supplied by thousands of accelerators working in tandem with processors to handle the complex calculations and massive datasets involved in training LLMs. A key challenge to be addressed: powerful computing capabilities fall short if the data gets stuck in data bottlenecks.

Scaling data centers for evolving AI models will help alleviate data bottlenecks by distributing workloads across thousands of GPUs. This scaling extends beyond the capabilities of individual accelerators or processors to encompass the entire data center architecture, including memory subsystems, switching fabrics, and interconnect technologies.

This is where PCIe 7.0 comes into play, the standard of choice to move data within the data center compute fabric.

Challenges for PCIe 7.0 System Architects

The development of PCIe 7.0 is progressing steadily, with the PCI-SIG recently releasing version 0.5 of the specification to its members. This milestone represents a significant step forward in the evolution of PCIe technology as it incorporates feedback from the previous version 0.3 released in June 2023. The PCIe 7.0 specification is on track for full release in the second half of 2025, enabling per-lane raw bit rates of 128 GT/s and up to 512 GB/s bi-directional bandwidth via a 16 lane configuration.

The semiconductor industry operates on long design cycles, typically ranging from nine to 12 months. This creates a unique challenge for companies aiming to bring PCIe 7.0-compatible products to market as soon as the specification is ratified. While the finalization of the PCIe 7.0 specification is still more than a year away, design work needs to begin now in order to make products available at the time of base specification release.

Interoperability

The successful implementation of the next PCIe generation requires a coordinated effort across the entire computing ecosystem, including processors, accelerators, retimers, switches, NICs, DPUs, and SSDs. For a first-pass silicon success of PCIe7, the whole ecosystem with PCIe7 needs to be available and compatible with each other when the new standard launches. This synchronization is crucial for ensuring seamless interoperability and maximizing the benefits of the new specification.

No Errors

Reliability is cornerstone for any designer working on PCIe 7.0 designs. PCIe 7.0 incorporates pattern mending and advanced diagnostic capabilities for real-time error detection and correction. These features significantly enhance overall system reliability, which is crucial for fleet management, enabling enhanced monitoring and maintenance of complex systems. SoC designers must seamlessly integrate these diagnostic features while ensuring minimal impact on performance and silicon area.

Confidential Compute and Security

HPC system architects need to consider all of their internal interfaces as possible attack vectors.  PCI Express 7.0 includes a feature called Integrity and Data Encryption (IDE), which allows PCIe devices to perform hardware encryption and integrity checking on packets transferred across PCIe links.  Fundamentally, IDE protects against hardware-level attacks conducted by skilled attackers who use sophisticated tools to gain direct access to their victim systems.  PCIe packets are individually encrypted and authenticated with an AES-GCM cryptographic algorithm to provide data confidentiality and integrity. IDE must be implemented hand-in-hand with a PCIe controller to get the full benefit of the protection mechanisms and provide optimal solutions. PCIe links secured by IDE also benefit from yet one more layer of reliability checking, since even a non-malicious modification of an IDE-protected PCIe packet will trigger a system-level response.

Summary

PCIe 7.0 is poised to play a pivotal role in AI data center scaling. From high-speed, low-latency data transfer to enhanced power efficiency and robust security features, PCIe 7.0 offers a comprehensive solution for next-generation AI applications. However, the design and implementation challenges for SoC designers are substantial. By addressing issues such as signal integrity, power efficiency, reliability, and security, and by leveraging proven expertise in the field, SoC designers can successfully integrate PCIe 7.0 IP into their designs. Ongoing innovations and collaborations in this field promise to drive significant advancements, enabling more efficient and scalable AI solutions.

Priyank Shukla is Principal Product Manager at Synopsys, which offers silicon to systems design solutions, from electronic design automation to silicon IP and system verification and validation.