In this guest feature, QLogic’s Joe Yaworski writes about improving collective performance on InfiniBand.
Today’s HPC clusters tend to be larger in terms of node count, and each node is now using faster, denser core count processors. This means that performance at scale is critical to optimize application performance on these larger, faster clusters. The performance of the interconnect is the key factor that determines the performance of the HPC cluster at scale. There are several items that determine the performance of the interconnect, including the following:
- Scalable latency
- High non-coalesced message rate performance
- Optimized collective performance
This article will focus on the performance of MPI collective operations. Collective performance is critical for the ability to scale the performance of an MPI application, especially on an HPC cluster.
About Collective Operations
In High Performance Computing (HPC), MPI is the standard for communication among processes that model a parallel program running on an HPC cluster. A collective operation is a concept in parallel computing in which data is simultaneously sent to or received from many nodes. Collective functions in the MPI API involve communication between all processes in a process group (which can mean the entire process pool or a program-defined subset). These types of calls are often useful at the beginning or end of a large distributed calculation, where each processor operates on a part of the data and then combines it into a result. Common examples of collective operations are “gather” (in which data is collected from all nodes), “scatter” (in which a set of data is broken up into pieces, and a different piece is sent to each of the nodes), and “broadcast” (in which the same data is sent to all nodes).
The performance of collective communication operations is known to have a significant impact on the scalability of most MPI applications. The nature of some collectives means that they can become a bottleneck when scaling to thousands of ranks (where a rank is an MPI process, typically running on a single core).
Forms of Collective Acceleration
There are three very different forms of collective acceleration. The first two revolve around some form of special collective acceleration, because the conventional thinking is that to obtain reasonable collective performance at scale, especially with InfiniBand, requires some sort of acceleration. The first form of collective acceleration is Host Channel Adapter-based collective acceleration, where an additional service runs on a “conventional” Host Channel Adapter’s processor and memory. The second form of acceleration is fabric-based collective acceleration. This approach offloads the collective acceleration to the InfiniBand fabric, where it runs in each of the InfiniBand switches. The third approach is an InfiniBand architecture that natively incorporates collective acceleration.
Adapter-based Collective Acceleration
One way to perform collective acceleration is to use the Host Channel Adapter to process specific collective operations. However, the HCA has limited processing capability and memory and therefore requires reads and writes onto the host buffers through the PCI bus for every operation and message. This means that the number of HCA resources required increases with the size of the MPI job and the scale of the cluster. Memory consumption also increases, causing higher latency for the collective operation.
Fabric-based Collective Acceleration
Fabric-based acceleration offloads the computation of collectives onto the fabric switches. Using this approach requires the use of a vendor-specific SDK. This form of collective acceleration is currently only integrated with OpenMPI and Platform MPI. Fabric-based collective acceleration does improve performance, but it has a higher incremental cost.
InfiniBand-based Collective Acceleration
The third form of collective acceleration is the type that is standard/built in or designed into the InfiniBand architecture. Rather than needing to be retrofitted to work with MPI, an InfiniBand fabric solution that has built-in collective performance acceleration allows standard collective algorithms to work as intended, including support for all MPIs and MPI collective algorithms. This type of acceleration does not require any special adapter and/or fabric-based collective acceleration to obtain optimal performance and scale.
The following tests provide representative comparisons of collective performance, including some application-level benchmark comparisons. The results come from published data, actual customer deployments, and QLogic tests.
Collective Test – Barrier
The barrier collective synchronizes all processes within a communicator. A node calling it will be blocked until all the nodes within the group have called it, which is why optimal barrier collective performance is key to maintaining an HPC cluster performance at scale.
The following information is from a Voltaire white paper on collectives acceleration (see References).
The benchmark results in Table 1 are based on the IMB Pallas collectives test. The adapter row in the table is representative of a conventional InfiniBand adapter, without any collective acceleration. The performance of the conventional InfiniBand Adapter is rather poor in comparison to the fabric-based and InfiniBand-based acceleration. The conventional InfiniBand Adapter shows latency as high as 3638 μs at 2048 cores, which is 168 times higher than the natively-accelerated InfiniBand. It is important to point out that as the size of the HPC cluster increases, so does the relative performance advantage of the accelerated InfiniBand architecture.
Figure 1: Performance Results – Collective Barrier Test
The natively-accelerated InfiniBand offers very good collective barrier performance without any special acceleration code or hardware assist. Figure 1 shows that it offers better latency than fabric-accelerated InfiniBand.
Collective Test – AllReduce
The AllReduce, after its operation, consolidates into its receive buffer the result of the pair-wise reduction of the send buffers of all processes, including its own. The operation is complete after it receives the results from all of the appropriate processes from the nodes/cores from around the cluster. This is why the performance of the AllReduce collective is another collective operation that is key to the ability of an application to scale, especially on a large cluster.
The following analysis is based on information from the Voltaire white paper on collective acceleration that was previously referenced.
Once again, the conventional InfiniBand Adapter line shows an extremely high latency at scale—3467 μs at 2048 cores. The fabric-accelerated solution offers better performance at 24 μs at 2048 cores. However, the best performance is provided by the natively-accelerated InfiniBand-based implementation at 22.6 μs at 2048 cores.
Figure 1: Performance Results – Collective AllReduce Test
The natively-accelerated InfiniBand implementation with AllReduce collectives once again offers excellent performance without any special acceleration code or hardware assist.
Collective Performance at Scale
Collective performance is one of the major factors in determining the ability for a cluster and applications running on the cluster to scale. Collective performance testing of the natively accelerated InfiniBand shows near-perfect scaling for collectives performance on a cluster of more than 14,000 cores.
The ANSYS® FLUENT® computational fluid dynamics application is designed to scale on HPC clusters. ANSYS has one of the industry’s best test suites that shows performance on different types of clusters and interconnects. ANSYS benchmark tests use a “Rating” result, where higher is better. The following analysis uses information provided in the Voltaire collective acceleration white paper that was previously referenced.
Figure 2: Eddy 417k Cell Model
The Eddy 417K model is a relatively small simulation, but it is an excellent test that shows off the potential performance of an interconnect. The reason for this is that a model of this size, when it is divided up over the nodes and cores on a cluster, spends very little time in processing and a disproportionate amount of time in communications. The more powerful the interconnect, the better the performance will be in the Eddy 417K test. In this test, conventional InfiniBand with no collective acceleration is equal to 1.0. The fabric-accelerated InfiniBand is 1.32 or 32 percent faster than conventional InfiniBand. The natively-accelerated InfiniBand is 1.73 or 73 percent faster than the conventional InfiniBand and more than 30 percent faster than fabric-accelerated collectives.
Figure 3: Aircraft 2M Cell Model
The Aircraft 2M test is a small-to-medium size benchmark model. In this case, natively-accelerated InfiniBand achieves an 87 percent performance advantage over the conventional InfiniBand and a 62 percent advantage over fabric-based acceleration.
Figure 4: Truck 111M Cell Model
The Truck 111M model is a relatively large benchmark test, which is less dependent on the interconnect. The reason for this is that there are a large number of cells that need to be processed per node and core, which means as a proportion there is less communication dependency.
Even in this case, the natively-accelerated InfiniBand achieves a 26 percent advantage over the conventional InfiniBand and an 8 percent advantage over the fabric-based approach.
Collective performance along with scalable latency and non-coalesced message rate performance determine the ability for a cluster to scale. An interconnect that is properly designed for HPC should not need special add-on collective acceleration. TrueScale InfiniBand was designed from the ground up for the HPC market and offers standard/built-in collective acceleration that achieves near-perfect collective scaling from a few nodes/cores to thousands of nodes/cores. Again, this is achieved without any special collective acceleration implementations.
About the Author:
Joe Yaworski is director of QLogic’s Global Alliance and Solution Marketing for QLogic. Within his Global Alliance responsibilities, he manages QLogic’s strategic partnerships and alliances in the High Performance Computing market space. Joe has helped build one of the industry’s broadest HPC ecosystems, which now includes alliances with over 70 companies. Joe’s Solution Marketing role is to help channel and alliance partners to create solution marketing programs that combined their offerings with QLogic’s HPC technologies. Also as part of his responsibilities, he directs the QLogic NETtrack Developer Center; which is used to test and certify partner applications and perform performance benchmarking.