Network Co-design as a Gateway to Exascale

Achieving Exascale performance (1018 FLOPs) will not be a simple additive process (i.e. add more nodes, more accelerators, more switches, etc.) As mentioned, a fundamental change in the design of HPC systems is needed. In one sense, the balkanized software hardware nature of many HPC systems has reached its limits. Software developers will need deeper access into the cluster so that codes can take advantage of hardware capabilities that were previously hidden. In addition, hardware designers will need to better understand specific software needs so hardware can be tweaked in new directions. In terms of network performance, there are four promising technologies that offer co-design capabilities at several levels.

This the fourth article in a series from the insideHPC Guide to Co-Design Architectures.

insideHPC Guide to Co-Design ArchitectureDownload it Now

Achieving better scalability and performance at Exascale will also require full data reach (i.e. users and designers must be able to analyze data wherever it is in the cluster). Without this capability, onload architectures force all data to move to the CPU before allowing any analysis. The ability to analyze data everywhere means that every active component in the cluster will contribute to the computing capabilities and boost performance. In effect, the interconnect will become its own “CPU” and provide in-network computing capabilities.

FCA and Core-Direct

Mellanox Fabric Collective Accelerator™ (FCA) provides offload capability for collective operations on a cluster. Collectives include operations like MPI broadcasts for sending around initial input data, reductions for consolidating data from multiple sources and barriers for global synchronization. These types of operations require synchronization of all the cores involved with an application and can present bottlenecks and reduced performance on clusters of moderate size.

FCA reduces these bottlenecks by offloading the collective communications to the InfiniBand host channel adapters. In addition to MPI, SHMEM/ PGAS and UPC based applications are supported. The fundamental technology, named COREDirect®(Collectives Offload Resource Engine), reduces CPU overhead and provides the capability to overlap communication operations with computation allowing applications to maximize asynchronous communication opportunities.

In addition to CPU cycle overhead, OS jitter can increase latencies across collective operations. OS jitter is caused by small randomly occurring (and necessary) OS interrupts taking place during the progression of collective operations. At scale, OS jitter can significantly reduce collective performance. By moving the collectives off of the CPU and main OS, jitter can be drastically reduced.

Currently, FCA 3.0 support the following MPI collectives:

  • Allgather/Iallgather
  • Allreduce/Iallreduce
  • Barrier/Ibarrier
  • Bcast/Ibcast

Scalable Hierarchical Aggregation Protocol (SHArP)

FCA and CORE-Direct technology are actually the first steps in bringing co-design to the HPC network. The recently introduced SHArP (Scalable Hierarchical Aggregation Protocol) technology from Mellanox moves support for collective communication from the network edges (on the hosts) to the core of the network (the switch fabric). Processing of collective communication is performed on dedicated silicon within the Mellanox InfiniBand switch (Switch-IB).

As an example, MiniFE is a Finite Element miniapplication that implements kernels representing implicit finite-element applications. The results for a SHArP cluster vs. a non-SHArP cluster are shown in Figure Eleven. Note the large number of nodes used for the benchmark and the improved performance as the number of nodes increases.

United Communication X Framework: UCX

UCX7 is a framework (a collection of libraries and interfaces) that provides efficient and relatively easy way to construct widely used and HPC communication protocols including MPI tag matching, RMA operations, rendezvous protocols, stream, fragmentation, remote and atomic operations.

UCX was developed based on the needs and overlapping interest of three separate projects, Mellanox Messaging (MXM), IBM Parallel Active Messaging Interface (PAMI), and Universal Common Communication Substrate (UCCS) from University of Tennessee. Each of these projects had the need for a generalized framework with which to use high performance networks.

The UCX founding members include DOE’s Oak Ridge National Laboratory (ORNL), IBM, the University of Tennessee (UTK), and NVIDIA. Additional members include University of Houston (UH), Pathscale SGI, ARM Holdings, and Los Alamos National Laboratory. The project combines the expertise and contributions from the following members.

  • Mellanox will co-design the network interface and contribute MXM technology, infrastructure, transport, shared memory, protocols, and integration with OpenMPI/SHMEM/MPICH
  • ORNL will co-design the network interface and contribute items from the UCCS project; InfiniBand optimizations, Cray devices, shared memory
  • NVIDIA will co-design high quality support for GPU devices, GPUDirect, GDR copy, etc.
  • IBM will co-design the network interface and contribute ideas and concepts from PAMI
  • UH/UTK will focus on integration with their research platforms

The goals of UCX are to spur collaboration between industry, laboratories, and academia; create open-source (BSD 3 Clause License) production grade communication framework for data centric and HPC applications; and enable the highest performance through co-design of softwarehardware interfaces. The key UCX components include:

  • UC-S for Services. A basic infrastructure for component based programming, memory management, and useful system utilities. Functionality: platform abstractions and data structures.
  • UC-T for Transport. A low-level API that expose basic network operations supported by underlying hardware. Functionality: work request setup and instantiation of operations.
  • UC-P for Protocols. A high-level API uses UCT framework to construct protocols commonly found in applications Functionality: multi-rail, device selection, pending queue, rendezvous, tag-matching, software-atomics, etc.

The UCX framework offers a unique co-design methodology that will allow software designers to use high performance aspects of the interconnect in a portable fashion. The UCX team clearly states that UCX is not a driver, but it will take advantage a close-to-hardware API layer providing standard access to hardware’s capabilities. The hardware vendors will supply UCX drivers. The UCX framework will serve as a high-performance, low latency communication layer that will assist applications developers with providing productive, extreme-scale programming languages and libraries that may include Partitioned Global Address Space (PGAS) APIs, such as Fortran Coarrays and OpenSHMEM, as well as OpenMP across multiple memory domains and on heterogeneous nodes.

Cache Coherent Interconnect for Accelerators: CCIX

A new project to create an interconnect that would allow different CPUs and accelerators to communicate while sharing main memory is called CCIX8 . Specifically, Advanced Micro Devices, ARM Holdings, Huawei Technologies, IBM, Mellanox, Qualcomm and Xilinx jointly announced that they would collaborate to build a cache-coherent fabric to interconnect their CPUs, accelerators and networks. Past efforts that provided similar functionality include the IEEE Standard for Scalable Coherent Interface (SCI).

The important (an difficult) aspect is cache coherency across the various devices. In current systems, GPU derived accelerators and FPGAs are accessed through the local PCIe bus. This design effective makes these devices slaves to the main processor with their own private memory domain. There is no memory coherence between these two domains. While convenient, the PCIe bus does not provide an ideal interconnect between two high-performance devices. For many applications the transfer of data from the processor domain to the accelerator domain can be a slow step in the computation.

CCIX provides a shared memory architecture so that each device would have equal access to one large memory domain. In order to make the shared memory work, each device would need to be cache-coherent with the main memory. (i.e. Changes to the values stored in a local cache need to be available to all devices. )

Previous efforts from both IBM and NVIDIA have produced similar technologies to address this issue. Starting with POWER8 systems, IBM has released a Cache-Coherent Accelerator Processor Interconnect (CAPI) that is also used by Xilinx to improve performance. CAPI is specific to IBM and will only work on their hardware. At the same time NVIDIA has developed their own solution called NVLink that moves data between NVIDIA GPUs and CPUs faster than over the current PCIe bus. At this point in time NVIDIA is not part of the CCIX project. It is reasonable to expect CAPI and CCIX to converge at some point.

If CCIX materializes it will provide a co-design platform where system designers can mix and match different types of computing devices (CPUs, GPUs, FPGAs, DSPs) to provide optimal performance for specific workloads.

Co-Design Architecture in practice: Oak Ridge and Lawrence Livermore National Labs

The co-design model is more than a clever idea. The recent supercomputing refresh in the United States was announced as part of the Collaboration of Oak Ridge, Argonne, and Lawrence Livermore Labs (CORAL) procurement. As part of the project three systems are to be delivered in the 2017/2018 time frame. The procurement was limited to processor makers and all three systems could not be the same architecture. The procurement was awarded to IBM (Oak Ridge, Lawrence Livermore) and Intel (Argonne). The machine at Oak Ridge National Labs (ORNL) has been named Summit and the Lawrence Livermore (LLNL) machine has been name Sierra.

Two centers of excellence have been established by IBM at LLNL and ORNL as part of the procurement. The goal of the centers is to get IBM system engineers close to the application developers so collaboration is possible. The centers will focus on application software as the machine hardware is being developed. The interaction is intended to generate feedback between the system developers and the application writers. Instead of delivering a system in a box, the co-design teams will develop the machine as it is built. The centers will bring together the people who know the science, the people who know the code, and the people who know the machines.

Applications developed at the centers will take advantage of innovations developed via the OpenPower community of developers while developments at the centers will also benefit general-purpose OpenPower-based commercial systems.

The Summit system is expected to deliver over 120-150 PetaFLOPs of performance using approximately 3,400 nodes when it arrives in 2017. Like previous ORNL systems, Summit will have a hybrid architecture. Each node will contain multiple IBM POWER9 processors and NVIDIA Volta GPUs connected together with NVIDIA’s high-speed NVLink. Each node will have over half a terabyte of memory addressable by both the CPUs and GPUs. There will also be 800GB of non-volatile RAM that can be used as a burst buffer or extended memory. To provide a the best messaging rate, the nodes will be connected in a non-blocking fat-tree using a dualrail Mellanox EDR InfiniBand interconnect—with all the co-design capabilities mentioned previously.

Similar to Summit, LLNL Sierra expected to provide performance with a 120-150 petaFLOPs. The Sierra system will include compute nodes with IBM POWER9 processors, NVIDIA Volta, NVMe-comaptible PCIe 800GB SSD, greater than 512 GB of coherent shared memory, compute racks. The Global Parallel File System will have 120 PB usable storage and 1.0 TB/s bandwidth. Like the Summit system, all nodes will be connected in a non-blocking fat-tree using a dual-rail Mellanox EDR InfiniBand interconnect —with all the co-design capabilities mentioned previously.

The modeling and simulation applications that will be designed/adapted to uses these supercomputers include applications from cosmology, climate research, biophysics, astrophysics, and other big science projects at ORNL. LLNL will develop applications for the US nuclear weapons program and other national security areas.

Over the previous and next few weeks we will explore each of these topic in detail.

If you prefer you can download the insideHPC Guide to Co-Design Architectures from the insideHPC White Paper Library.