At the recent SC14 conference in New Orleans, Rich Brueckner of insideHPC met up with Gilad Shainer to learn more about the latest InfiniBand technology advancements from Mellanox Technologies and the next generation of GPUDirect.
insideHPC: This is the first night [of SC14] and it’s going to be a great week. Speaking of that Gilad, what’s new at Mellanox?
Gilad Shainer: Well there are many new things. I think from our perspective, the most interesting thing here is our announcement of having an end-to-end 100-gigabit per second. It’s a major achievement. We worked on the 100-gigabit per second for the last three years. We waited for it. Other people waited for it. And now we’ve announced the end-to-end 100-gigabit per second that includes cables, and we’re demonstrating 100-gigabit per second on four meters, six meters and eight meters copper, which is amazing. It is amazing to be able to drive at that distance, right? We’re also demonstrating five meters and 100 meters on fiber, recovering both copper and fiber. The switch enables latencies of less than 90 nanoseconds. So with the 100-gigabit per second switch, we double the prowess and we reduce the latency by half. Amazing achievement. We start shipping those for revenue this quarter. This week we also announced ConnectX-4 which is the 100-gigabit per second adapter supporting both InfiniBand and Ethernet. That’s concluding the 100-gigabit per second end-to-end, and with that we’ll enable the build of the next generation high performance computing infrastructures running at amazing speeds.
insideHPC: Speaking of that, I wanted to ask you about CORAL. This is a pair of 150 petaflop type machines coming to Oak Ridge and Livermore, right? So these things are going to be POWER9, NVIDIA accelerators and Mellanox?
Gilad Shainer: Mellanox EDR 100-gigabit per second.
insideHPC: Oh, EDR. Okay, that’s the detail I missed. So three years from now, are you going to be ready?
Gilad Shainer: We are ready now. We have the EDR switches on the shop floor, we announced the adapter already. So of course we’ll be ready. We were very proud that the DOE selected the combination of IBM Power, NVIDIA GPUs, and Mellanox EDR for building the next generation of leadership systems, as we said, one in Oak Ridge, the second one in Livermore. You know the DOE did review all the technology that exists today – the road map for multiple companies in the future. They decided that the combination here is the best technology that they can achieve. So feeling very proud, we work very hard, we drive the technology as fast as we can. In the 2017 time frame, we do plan to go out with HDR, with a 200-gigabit per second speed.
insideHPC: Okay, so HDR is what comes after EDR?
Gilad Shainer: That’s what comes after EDR. We were proud again to be selected by the DOE and continue to drive the technology as fast as we can.
insideHPC: Well, I’ve got to tell you I was excited because I see this as a major step towards exoscale. Would you agree?
Gilad Shainer: I agree. I think those systems are paving the path to exoscale. So that technology that is being awarded now is for the Department of Energy to make sure that that technology will be there for the 150 petaflop systems, which are getting very close to the exoscale. So the technology will continue and be used for different future systems.
insideHPC: Great. Gilad, I understand that you have an announcement with NVIDIA this week. Can you tell me more?
Gilad Shainer: So, we continue to work with the ecosystem. We’re working with IBM as part of OpenPOWER. We announced capabilities of cash coherency, or coherency protocol, as part of ConnectX-4 that connects to the POWER8 from IBM. There’s more work we are doing on the ARM platform. We are doing a demonstration with 64-bit ARM as part of HP Moonshot 64-bit ARM enterprise platform. There’s work we are doing on x86 on the software limit. With NVIDIA this week, we announced the next generation of GPUDirect. We’ve been working with NVIDIA for a few years already. We’ve released the GPUDirect 1.0, 2.0, 3.0, which is also called GPUDirect RDMA, and now we’re announcing GPUDirect 4.0, which is GPUDirect [?]. The major difference is that the GPUDirect RDMA enabled, for the first time, the data path to go directly from the GPU to the network so that they achieve the great performance benefits; the control path still went through the CPU. Now, with GPUDirect 4.0 both the data path and the control path go directly between the GPU and the network. So, if with GPUDirect 3.0 we achieved like 2x performance improvement, now bring in the control path and we see another 20 to 30% performance for applications.
insideHPC: Lower latency again right?
Gilad Shainer: Lower latency, better connection between those elements and after that, higher application performance.
insideHPC: Sounds like an exciting week in store for Mellanox. Congratulations.
Gilad Shainer: Thank you very much, Rich.