OFA Workshop in Austin to Put Spotlight on InfiniBand and RoCE

In this special guest feature, Bill Lee from IBTA writes that the upcoming Open Fabrics Workshop in Austin will feature a number of talks on the latest in InfiniBand and RoCE technologies.

Bill Lee is Chair of the Marketing Working Group at IBTA

The OpenFabrics Alliance (OFA) Workshop is an annual event devoted to advancing the state of the art in networking. The workshop is known for showcasing a broad range of topics all related to network technology and deployment through an interactive, community-driven event. The comprehensive event includes a rich program made up of more than 50 sessions covering a variety of critical networking topics, which range from current deployments of RDMA to new and advanced network technologies.

Update: View the Final Agenda with a full list of abstracts.

This year’s workshop program will also feature some notable sessions that showcase the latest developments happening for InfiniBand and RoCE technology. Below are is the collection of OFA Workshop 2017 sessions that we recommend you check out:

Developer Experiences of the First Paravirtual RDMA Provider and Other RDMA Updates, Presented by Adit Ranadive, VMware. VMware’s Paravirtual RDMA (PVRDMA) device is a new NIC in vSphere 6.5 that allows VMs in a cluster to communicate using Remote Direct Memory Access (RDMA), while maintaining latencies and bandwidth close to that of physical hardware. Recently, the PVRDMA driver was accepted as part of the Linux 4.10 kernel and our user-library was added as part of the new rdma-core package. In this session, we will provide a brief overview of our PVRDMA design and capabilities. Next, we will discuss our development approach and challenges for joint device and driver development. Further, we will highlight our experience for upstreaming the driver and library with the new changes to the core RDMA stack. We will provide an update on the performance of the PVRDMA device along with upcoming updates to the device capabilities. Finally, we will provide new results on the performance achieved by several HPC applications using VM DirectPath I/O. This session seeks to engage the audience in discussions on: 1) new RDMA provider development and acceptance, and 2) hardware support for RDMA virtualization.

Experiences with NVMe over Fabrics, presented by Parav Pandit, Mellanox. NVMe is an interface specification to access non-volatile storage media over PCIe buses. The interface enables software to interact with devices using multiple, asynchronous submission and completion queues, which reside in memory. Consequently, software may leverage the inherent parallelism and low latency of modern NMV devices with minimal overhead. Recently, the NMVe specification has been extended to support remote access over fabrics, such as RDMA and Fibre Channel. Using RDMA, NVMe over Fabrics (NVMe-oF) provides the high BW and low-latency characteristics of NVMe to remote devices. Moreover, these performance traits are delivered with negligible CPU overhead as the bulk of the data transfer is conducted by RDMA. In this session, we present an overview of NVMe-oF and its implementation in Linux. We point out the main design choices and evaluate NVMe-oF performance for both InfiniBand and RoCE fabrics.

Validating RoCEv2 for Production Deployment in the Cloud Datacenter, presented by Sowmini Varadhan, Oracle. With the increasing prevalence of ethernet switches and NICs in Data Center Networks, we have been experimenting with the deployment of RDMA over Commodity Ethernet (RoCE) in our DCN. RDMA needs a lossless transport, and, in theory, this can be achieved on ethernet by using priority based PFC (IEEE 802.1qbb) and ECN (IETF RFC 3168). We describe our experiences in trying to deploy these protocols in a RoCEv2 testbed running @ 100 Gbit/sec consisting of a multi-level CLOS network. In addition to addressing the documented limitations around PFC/ECN (livelock, pause-frame-storm, memory requirements for supporting multiple priority flows), we also hope to share some of the performance metrics gathered, as well as some feedback on ways to improve the tooling for observability and diagnosability of the system in a vendor-agnostic, interoperable way.

Host Based InfiniBand Network Fabric Monitoring, presented by Michael Aguilar, Sandia National Laboratories. Synchronized host based InfiniBand network counter monitoring of local connections at 1Hz can provide a reasonable system snapshot understanding of traffic injection/ejection into/from the fabric. This type of monitoring is currently used to enable understanding about the data flow characteristics of applications and inference about congestion based on application performance degradation. It cannot, however, enable identification of where congestion occurs or how well adaptive routing algorithms and policies react to and alleviate it. Without this critical information the fabric remains opaque and congestion management will continue to be largely handled through increases in bandwidth. To reduce fabric opacity, we have extended our host based monitoring to include internal InfiniBand fabric network ports. In this presentation we describe our methodology along with preliminary timing and overhead information. Limitations and their sources are discussed along with proposed solutions, optimizations, and planned future work.

IBTA TWG – Recent Topics in the IBTA, and a Look Ahead, presented by Bill Magro, Intel on behalf of InfiniBand Trade Association. This talk discusses some recent activities in the IBTA including recent specification updates. It also provides a glimpse into the future for the IBTA.

InfiniBand Virtualization, presented by Liran Liss, Mellanox on behalf of InfiniBand Trade Association. InfiniBand Virtualization allows a single Channel Adapter to present multiple transport endpoints that share the same physical port. To software, these endpoints are exposed as independent Virtual HCAs (VHCAs), and thus may be assigned to different software entities, such as VMs. VHCAs are visible to Subnet Management, and are managed just like physical HCAs. This session provides an overview of the InfiniBand Virtualization Annex, which was released on November 2016. We will cover the Virtualization model, management, addressing modes, and discuss deployment considerations.

IPoIB Acceleration, presented by Tzahi Oved, Mellanox. The IPoIB protocol encapsulates IP packets over InfiniBand datagrams. As a direct RDMA Upper Layer Protocol (ULP), IPoIB cannot support HW features that are specific to the IP protocol stack. Nevertheless, RDMA interfaces have been extended to support some of the prominent IP offload features, such as TCP/UDP checksum and TSO. This provided reasonable performance for IPoIB. However, new network interface features are one of the most active areas of the Linux kernel. Examples include TSS and RSS, tunneling offloads, and XDP. In addition, the basic IP offload features are insufficient to cope with the increasing network bandwidth. Rather than continuously porting IP network interface developments into the RDMA stack, we propose adding abstract network data-path interfaces to RDMA devices. In order to present a consistent interface to users, the IPoIB ULP continues to represent the network device to the IP stack. The common code also manages the IPoIB control plane, such as resolving path queries and registering to multicast groups. Data path operations are forwarded to devices that implement the new API, or fallback to the standard implementation otherwise. Using the forgoing approach, we show how IPoIB closes the performance gap compared to state-of-the-art Ethernet network interfaces.

Packet Processing Verbs for Ethernet and IPoIB, presented by Tzahi Oved, Mellanox. As a prominent user-level networking API, the RDMA stack has been extended to support packet processing applications and user-level TCP/IP stacks, initially focusing on Ethernet. This allowed delivering low latency and high message-rate to these applications. In this talk, we provide an extensive introduction to both current and upcoming packet processing Verbs, such as checksum offloads, TSO, flow steering, and RSS. Next, we describe how these capabilities may also be applied to IPoIB traffic. In contrast to Ethernet support, which was based on Raw Ethernet QPs that receive unmodified packets from the wire, IPoIB packets are sent over a “virtual wire”, managed by the kernel. Thus, processing selective IP flows from user-space requires coordination with the IPoIB interface.

The Linux SoftRoCE Driver, presented by Liran Liss, Mellanox. SoftRoCE is a software implementation of the RDMA transport protocol over Ethernet. Thus, any host to conduct RDMA traffic without necessitating a RoCE-capable NIC, allowing RDMA development anywhere. This session presents the Linux SoftRoCE driver, RXE, which was recently accepted to the 4.9 kernel. In addition, the RXE user-level driver is now part of rdma-core, the consolidated RDMA user-space codebase. RXE is fully interoperable with HW RoCE devices, and may be used for both testing and production. We provide an overview of the RXE driver, detail its configuration, and discuss the current status and remaining challenges in RXE development.

Ubiquitous RoCE, presented by Alex Shpiner, Mellanox. In recent years, the usage of RDMA in datacenter networks has increased significantly, with RoCE (RDMA over Converged Ethernet) emerging as the canonical approach to deploying RDMA in Ethernet-based datacenters. Initially, RoCE required a lossless fabric for optimal performance. This is typically achieved by enabling Priority Flow Control (PFC) on Ethernet NICs and switches. The RoCEv2 specification introduced RoCE congestion control, which allows throttling transmission rate in response to congestion. Consequently, packet loss may be minimized and performance is maintained even if the underlying Ethernet network is lossy. In this talk, we discuss the details of latest developments in the RoCE congestion control. Hardware congestion control reduces the latency of the congestion control loop; it reacts promptly in the face of congestion by throttling the transmission rate quickly and accurately; when congestion is relieved, bandwidth is immediately recovered. The short control loop also prevents network buffers from overfilling in many congestion scenarios. In addition, fast hardware retransmission complements congestion control in heavy congestion scenarios, by significantly reducing the penalty of packet drops.

Keep an eye out as videos of the OFA Workshop 2017 sessions will be published on both the OFA website and insideHPC.

Register now for the OFA Workshop

Check our our insideHPC Events Calendar

Sponsored Guest Articles

Microsoft and NVIDIA Together Advance AI

White Papers

Energy efficiency drives HPC to the cloud

Featured RSS Feed

More News from insideBIGDATA