Offload vs Onload: How Mellanox Champions Flexibility and Scalability

Print Friendly, PDF & Email

In this video, Gilad Shainer from Mellanox discusses how the company’s off-load model of InfiniBand reduces overhead on the CPU and provides maximum application performance.


insideHPC: Hi, I’m Rich with insideHPC. We’re here in Frankfurt at ISC 2015 at Champions Sports Bar and you guys are certainly the champions of low-latency interconnect. I wanted to ask you, what are the most important things to focus on when it comes to system interconnects?

Gilad Shainer: I think interconnect’s becoming more and more important. In the world of more and more data, it’s important to be able to move the data faster so you can analyze the data faster; you’re doing things in real time. It’s important for HPC, but it’s also become more and more important in other areas of data analytics, machine learning and other things. They all depend on moving the data faster and faster and faster.

Now, in the interconnect world, there’s two architectures that people choose to implement. One kind of architecture is what is called on-load architecture. And on-load architecture takes all the operations out of the interconnect and pretty much runs everything on the CPU, runs everything on the software. Why? Because it’s easier to build network devices because they don’t need to be smart, because then you can change protocols and create things and modify them from generation to generation because everything is done in the CPU.

The other approach is what’s called off-load architecture where you actually moves things that are traditionally done in the CPU or in the software, and run it on the network devices in the network in-hardware. When you look into performance and scalability – that’s what matters – how to increase application performance. How do I scale? How do I enable further efficiencies?

On-load doesn’t give you that element because in on-load, you’re still running things on the CPU, things running things on the software, and you create a lot of overheard on the CPU. You reduce the CPU utilization to run more applications like this because the CPU needs to manage the network, the CPU needs to create data packets. The CPU needs to check reliability and things so that there is a lot of things that a CPU does which is not related to the application at all. So you reduce the CPU efficiencies and you end up running things much slower.

Qlogic for example, in the days where they did Infiniband, this was their approach and that’s why Qlogic didn’t really manage to take market-share–because it didn’t provide efficiencies or performance on this scale.

Intel acquired the Qlogic infiniband parts and now they are maintaining the same on-load architecture, why? Because Intel does a CPU, it’s easier for them to do that but it doesn’t give the value to the user, it doesn’t give the value to the application side. Okay?

On the other side, we believe in Offloads. So moving things from the CPU to the network, that enables you to increase the performance of the application. That enables you to move things faster, that enables you to scale, and we see it in multiple evidence. We see it for example, on the TOP500 list where we connect most of the systems with Infiniband. And the reason is because Infiniband does give them the ability to scale, to achieve the performance to run what they want to do, and on that architecture we’ll continue to invest. This is where we think that the path to exascale goes through.

insideHPC: Okay. So you guys have been using Offload from the beginning at Mellanox and you’ve proven the scaleability, I wanted to ask you about integration because thats where the focus seems to be. Is that the target?

Gilad Shainer: Yeah, it’s a good question because there are other people talking about integration or not integration. But integration is not a target at the end of the day right? We are not here to do integration, we are here to enable exascale. So taking a NIC device and a CPU device – still connected with PCI express – and put it in one package and call it an integrated device – is a target – it’s not what you want to focus on. You want to focus on, how do I enable the next level of performance? And that takes you to look into not the integration as the target, you need to look on the data center as a whole.

You need to look at the applications running in the data center – the data center itself – and look at how do you take the data center to the next level of performance? And when you look on that basis, you’ll understand that the only way to do it is with co-design. So working with partners and doing co-design between hardware and hardware, between hardware and software and between software and software. For example, the project that we did with GPUDirect RDMA, that was a co-design between hardware and hardware to enable the next level of performance. So we’ve been able to application latencies 60-70%, and actually achieved 2X performance on different applications, that’s one example.

Now, in a software and hardware co-design, the idea is to move things from the software to the network and when you move things from the software to the network you can actually make the next big thing. Back a few years, the big thing was moving from a single core to more multi-core. Why? Because the most you can get out of a CPU recurrent increase the– you could increase the frequency a little bit but it’s not making the big deal, the big deal was moving to multi-core because now you can enable many more processes to run. You look at that from an application perspective.

The same happens here, so network devices today run in nanoseconds. The Mellanox switch runs in ninety-nanoseconds, faster than anything else that exists out there. My adapter card runs in less than 150 nanoseconds, so we’ll talk about nanoseconds here and making the packaging or not packaging, you’re going to cut what, two or three nanoseconds, there is not much you can cut with moving to the package between the integration or packaging. So packaging integration is not the target, the target is, how do you go to the next level? The next level is what we did today with the large ecosystem of partners, is looking in to the data center and application-wise, and looking in to complete communication scenarios and taking those complete communication scenarios that trans tens of microseconds on the CPU today, and integration of packaging will not help you getting that faster anymore.

We’re moving those and map them on all the network devices and once you move and map that to all the network devices in an off-load way, you can take operations to do the trans-intensive microseconds and run them in a single microsecond data, one or two microseconds.

insideHPC: So, in these terms then when we’re thinking about the whole system – the whole cluster – flexibility and topology seem to me to really matter because my application’s not going to be the same as Brians’ over there. His topology might not work for me. I wanted to ask you about flexibility in this offload approach?

Gilad Shaner: Right, that’s a good question. So flexibility is important, so standards, flexibility, open-source, enable people to innovate, so that’s where it’s important. Infiniband was designed by a large ecosystem to be the most flexible network out there. Infiniband was the first SDN network. If we look at it and InfiniBand switches for example, support different topologies by implementing routing tables. So those are tables that map incoming data to where it needs to go out, and since those are open-tables that are exposed, you can support any topology that you want with Infiniband switches. If it’s Fat Tree, or islands of Fat Trees or Toruses, or Dragonflies, or meshes or anything that is going to come in the future, you can support that. And it’s not only that, it’s also exposed to the user so you can go and change the routing yourself and optimize that to your application. This is where innovation comes, this is where performance comes.

With the proprietary protocols people try to invent comes a lack of flexibility. So we see the new proprietary protocols coming out, there’s protocols that cannot map to routing their switches and they are forcing users to go and use a specific topology because of lack of flexibility. Because they cannot support something else, and because of multiple reasons of issues that they need to deal with, that’s something that we don’t believe that will take us to the future or will enable the next level of application.

So Mellanox will continue to invest in what we do, open, supporting those multiple protocols, having those routing tables, exposing that interface to the user, and enable the user not to just think about how to optimize applications but also how to build the next exascale systems which might be a different topology or maybe support other things. And that’s enabling not just ability to optimize now, but also to innovate in the future.

Sign up for our insideHPC Newsletter.