Mellanox intros 120 Gbps switch, application offloading of MPI into the adapter

Today Mellanox Technologies announced two new additions to their InfiniBand technology offering from the show floor at SC09.

120 Gbps InfiniBand switch

First up is a 120 Gbps InfiniBand switch. From the release

Mellanox logoBased on InfiniScale IV, Mellanox’s 4th generation of InfiniBand switch silicon, the IS5000 switch system family delivers the highest networking bandwidth per port to enable the next generation of high-performance computing, cloud infrastructures and enterprise data centers. The new switch solutions reduces network congestion and the number of network cables by a factor of three, providing customers with the optimal combination of cost-effective, proven performance and efficiency enhancements to address next-generation, Petascale computing demands.

This switch is actually getting some air time on the show floor this year as the hardware enabling the 120 Gbps IB network that exhibitors can connect to as part of SCinet. Mellanox’s John Monson told me in a conversation ahead of the announcement that the switch itself is ready, but won’t be out in general availability until Q1 of next year to give them time to develop the ecosystem of products that go with it.

As I was talking to Monson, we got sidetracked into a discussion of where Mellanox’s business is, and I was fascinated to learn that China alone was 40% of Mellanox’s revenue last quarter (the recently announced Tianhe, for example, is Mellanox end-to-end according to Monson). Russia is also significant in terms of revenue, a fact that provides an external indicator of the degree to which Russia’s stated interest in HPC is turning into action. Spotting a pattern, I asked about the other half of the BRICs — India and Brazil. India is “on the radar,” according to Monson, but Brazil is still developing.

MPI communication offloads

The other significant technology move the company announced from SC09 this week is application offloading as part of their ConnectX-2 InfiniBand adapters, announced in August of this year.

Of course you are probably familiar with the idea of offloading network protocol overhead onto adapter cards. An example of this is TCP/IP offload engines (TOEs) that move the protocol management — adding headers, forming packets, and so on — away from the processor onto the network card itself, freeing up the processor to do more application work. Mellanox’s IB cards already do this. “Application offload” is the same idea, only now extended to things like collective operations with MPI.

Broadly speaking, applications do two kinds of work: compute and communicate. In MPI applications the communications from process-to-process starts in the CPU, where the MPI messages are packed up and passed down to the NIC, which then uses its protocol (IB in this case) to send it to the receiving process. On the receiving side once the NIC reconstructs the data stream it passes that data back up to the processor, where it is reassembled into MPI messages and handed off to the parallel application. Having the sending and receiving processors so intimately involved in the processing of MPI data for inter-process communication introduces noise and jitter, and reduces the ability to create a fully synchronized system. All of which hurts application scalability in 20-40% range according to the company.

Mellanox was part of a team funded by the Department of Energy to find a solution for this problem, the result is Application offload, which moves the MPI processing part of application communication down to the NIC as well, leaving the CPU to do only the application compute cycles.

From the release

Mellanox ConnectX-2 InfiniBand adapters introduce a new offloading architecture that provides the capability to offload application communications frequently used by scientific simulation for data broadcast, global synchronization and data collection. By offloading these collectives communication, ConnectX-2 adapters help to reduce simulation completion by accelerating the synchronization process and freeing up CPU cycles to work on the simulation, and enable greater scalability by eliminating system jitter and noise — the biggest issues for performance at scale.

The technology was developed in collaboration with Oak Ridge (being recognized this week with the inaugural insideHPC HPC Community Leadership Award), and is a firmware change that is supported only in the ConnectX-2 line of adapters. According to the company, they expect beta users in Q1 of next year.