Seeking Ethernet Alternative to InfiniBand?  Start with Performance!


[SPONSORED GUEST ARTICLE]
   When it comes to AI and HPC workloads, networking is critical.

While this is well known already, the impact your networking fabric performance has on parameters like job completion time can affect your cost structure by more than 30 percent.

Most of the cost of large GPU clusters like those used for AI training, goes to the compute elements – the GPUs, and only 10 percent of the overall cost is typically designated to networking gear. The thing is, a non-optimal networking fabric can make those very expensive GPUs remain idle for a significant amount of time, critically impacting the platform economics.

Should you go with the “default choice”?

Given the importance of networking on the overall economics of the AI infrastructure, you should go with the highest performing networking fabric. For years, the benchmark solution was InfiniBand. Built with HPC use cases in mind, InfiniBand provided the optimal performance and was considered to be the default choice for anyone with a high-performing networking fabric in mind.

In the last couple of years, though, InfiniBand solutions have lost some of their appeal. First, the ecosystem for this solution shrunk over the years (and, in particular, since Nvidia’s acquisition of market leader – Mellanox) with all of the other players discontinuing their InfiniBand offerings. This leaves InfiniBand as a practically-proprietary solution from a single vendor (Nvidia). Second, the skillset required to bring up and maintain this rather-complex solution became an obstacle for high-paced adoptions and deployments of AI infrastructure.

And there is also the high-cost issue.

What is the alternative to InfiniBand?

The usual suspect, when it comes to InfiniBand alternatives, is Ethernet.

Ethernet is used in data centers (and everywhere else, for that matter), and is practically commoditized, yielding better economics, and available from a very long list of vendors.

The issue with Ethernet is its performance. Ethernet was designed as a lossy technology, with basic congestion mitigation measures that work well for the statistically scattered traffic patterns of your typical data center. It does not cope well, to say the least, with the unique traffic patterns of an HPC or AI cluster backend fabric.

The backend fabric is the network that connects GPUs to each other, and the traffic there is full of elephant-flows which are usually too much to handle for the hashing mechanisms that are distributing the traffic across the fabric, or even for advance mechanisms like Explicit Congestion Notification (ECN) and Priority-based Flow Control (PFC).

These factors cause a significant decrease in performance. Typically, 30-50 percent lower performance compared to the InfiniBand benchmark.

How do you measure networking fabric performance?

But how do you even measure the performance of the networking fabric?

The best way is to simply look at the job completion time (JCT) for the desired workload – the shorter the JCT, the higher the networking performance. If you are looking for some more tangible parameters you can go down a level and look at the collective communication parameters. For instance, look at the NCCL bus-bandwidth figures and compare it the different fabric solutions.

You can also look at specific networking parameters such as tail latency, or even packet drop, but it is better to measure the performance as high up the protocol stack as possible.

Is there an Ethernet based alternative with InfiniBand-level performance?

The billion petaflop question is can you move away from InfiniBand and switch to Ethernet without paying this huge ‘fine’ in performance.

To scale up Ethernet performance to the level set by InfiniBand, congestion, which is an integral part of Ethernet behavior needs to be managed, mitigated or even avoided altogether. This requires some kind of scheduling mechanism to augment current protocols like ECN (Explicit Congestion Notification) and PFC (Priority-based Flow Control) that are used for congestion management.

There are two main approaches to scheduling – the first is endpoint scheduling. Here, the packets entering the fabric from the endpoints – the network interface cards in the servers, are sprayed across the fabric based on awareness created in these end point devices regarding congestion hot spots in the fabric. The congestion information is achieved through network telemetry. This mechanism is implemented in the new Ultra Ethernet standard, as well as in proprietary solutions, like Nvidia’s Spectrum-X.

The second practice is fabric-based scheduling. In this method, there are no requirements to the endpoints. Any packet entering the fabric is split into evenly sized cells which are distributed (sprayed) across the (cell-based) fabric. Here, the load balancing is perfect and, with the addition of VoQs (Virtual Output Queues) and a credit-based mechanism, congestion is avoided altogether. This method is implemented successfully in DriveNets Network Cloud-AI solution and is the only solution to date, that achieves InfiniBand-level performance with a standard Ethernet solution. It is described in WhiteFiber’s recent blog post sharing their experience with the technology.