In-Network Computing Technology to Enable Data-Centric HPC and AI Platforms

in-network computing

In this sponsored post, Mellanox Technologies’ Gilad Shainer explores one of the biggest tech transitions over the past 20 years: the transition from CPU-centric data centers to data-centric data centers, and the role of in-network computing in this shift. 

in-network computing

Gilad Shainer is VP of Marketing at Mellanox.

The never ending demands for higher performance led us through two major technology transitions during the past two decades, and are now driving an on-going third transition. These transitions have aimed to overcome performance bottlenecks and therefore to enhance research and product development. The first one was the transition from SMP systems to clusters, and the second was the transition from single core CPUs to multi-core compute elements (CPUs, GPUs). The on-going third one is the transition from CPU-centric data centers to data-centric data centers.

The old CPU-centric data center architecture approach was based on the idea that the CPU is the heart of the data center, and one needs to move the data to the CPU in order to analyze it, or to simulate world phenomena. With the increase in the amount of data we wish to analyze, the complexity of the research simulations and the emerging capabilities facilitating deep learning algorithms, the old CPU-centric approach came to its end. Consider as an example any data reduction or aggregation based simulation, which is the basis of any deep learning application: It cannot be accelerated any further by adding more CPUs. On the contrary, adding more CPUs will degrade the performance of those applications. Furthermore, the cost of moving greater amounts of data becomes the major obstacle in this approach.

The latest technology transition is the result of a co-design approach, a collaborative effort to reach Exascale performance by taking a holistic system-level approach to fundamental performance improvements. As the CPU-centric approach has reached the limits of performance and scalability, the data center architecture focus has shifted to the data, and how to bring compute to the data instead of moving data to the compute. This approach has driven the creation of smart components outside the CPU, forming key elements of what is now commonly known as In-Network Computing technologies. In-Network Computing transforms the data center interconnect to become a “distributed co-processor” that can handle and accelerate the performance of various data algorithms, such as reductions and more.

In-Networking Computing based interconnects are the heart of the data-centric data centers. These interconnects provide first the ability to offload all of network functions from the CPU to the network, and second the ability to offload various data algorithms. Illustrating this is the new generation of HDR 200 Gigabit per second InfiniBand interconnect, which includes several technologies for In-Network Computing: an enhanced version of Scalable Hierarchal Aggregation and Reduction Protocol (SHARP), interconnect-based MPI Tag Matching and Rendezvous protocol, accelerates both HPC and deep learning applications, and other features of NVMe storage, security and more.

Figure 1 and Figure 2 below demonstrate the performance advantages of SHARP, using the MPI AllReduce collective operation. Testing was conducted on the InfiniBand accelerated Dragonfly+ Niagara supercomputer, the fastest supercomputer in Canada that is owned by the University of Toronto and operated by SciNet. Niagara is intended to enable large parallel jobs, and was designed to optimize the throughput of a wide range of scientific code running with energy efficiency, and network and storage performance and capacity at scale. Niagara consists of 1500 nodes, where each node comprises 40 Intel Skylake cores at 2.4GHz (for a total of 60,000 cores), and 202 GB of RAM per node — all connected in an EDR InfiniBand network forming a Dragonfly+ topology.

Figure 1 – MPI AllReduce performance comparison – software based versus SHARP with 1 process per node, and overall 1,500 MPI ranks. (Chart: Courtesy of Mellanox)

Figure 2 – MPI AllReduce performance comparison – software based versus SHARP with 40 processes per node, and overall 60,000 MPI ranks. (Chart: Courtesy of Mellanox)

The HDR InfiniBand technology includes an enhanced SHARP technology, that supports more algorithms and larger data sizes, all of which are expected to further increase the performance and better support deep learning frameworks.

In August 2018, it was announced that HDR InfiniBand was selected to accelerate the new large-scale Frontera supercomputer to be deployed at the Texas Advanced Computing Center (TACC). Frontera will enjoy the latest generation of In-Network Computing technologies that will equip it to deliver the highest application performance, scalability and efficiency. We can expect to see more HPC and AI platforms become based on In-Network Computing and data-centric architectures to serve the growing needs of research, explorations, and product development.

Gilad Shainer is VP of Marketing at Mellanox Technologies