In this guest feature, QLogic’s Joe Yaworski writes about improving collective performance on InfiniBand.
Today’s HPC clusters tend to be larger in terms of node count, and each node is now using faster, denser core count processors. This means that performance at scale is critical to optimize application performance on these larger, faster clusters. The performance of the interconnect is the key factor that determines the performance of the HPC cluster at scale. There are several items that determine the performance of the interconnect, including the following:
- Scalable latency
- High non-coalesced message rate performance
- Optimized collective performance
This article will focus on the performance of MPI collective operations. Collective performance is critical for the ability to scale the performance of an MPI application, especially on an HPC cluster.
About Collective Operations
In High Performance Computing (HPC), MPI is the standard for communication among processes that model a parallel program running on an HPC cluster. A collective operation is a concept in parallel computing in which data is simultaneously sent to or received from many nodes. Collective functions in the MPI API involve communication between all processes in a process group (which can mean the entire process pool or a program-defined subset). These types of calls are often useful at the beginning or end of a large distributed calculation, where each processor operates on a part of the data and then combines it into a result. Common examples of collective operations are “gather” (in which data is collected from all nodes), “scatter” (in which a set of data is broken up into pieces, and a different piece is sent to each of the nodes), and “broadcast” (in which the same data is sent to all nodes).
The performance of collective communication operations is known to have a significant impact on the scalability of most MPI applications. The nature of some collectives means that they can become a bottleneck when scaling to thousands of ranks (where a rank is an MPI process, typically running on a single core).
Forms of Collective Acceleration
There are three very different forms of collective acceleration. The first two revolve around some form of special collective acceleration, because the conventional thinking is that to obtain reasonable collective performance at scale, especially with InfiniBand, requires some sort of acceleration. The first form of collective acceleration is Host Channel Adapter-based collective acceleration, where an additional service runs on a “conventional” Host Channel Adapter’s processor and memory. The second form of acceleration is fabric-based collective acceleration. This approach offloads the collective acceleration to the InfiniBand fabric, where it runs in each of the InfiniBand switches. The third approach is an InfiniBand architecture that natively incorporates collective acceleration.
Adapter-based Collective Acceleration
One way to perform collective acceleration is to use the Host Channel Adapter to process specific collective operations. However, the HCA has limited processing capability and memory and therefore requires reads and writes onto the host buffers through the PCI bus for every operation and message. This means that the number of HCA resources required increases with the size of the MPI job and the scale of the cluster. Memory consumption also increases, causing higher latency for the collective operation.
Fabric-based Collective Acceleration
Fabric-based acceleration offloads the computation of collectives onto the fabric switches. Using this approach requires the use of a vendor-specific SDK. This form of collective acceleration is currently only integrated with OpenMPI and Platform MPI. Fabric-based collective acceleration does improve performance, but it has a higher incremental cost.
InfiniBand-based Collective Acceleration
The third form of collective acceleration is the type that is standard/built in or designed into the InfiniBand architecture. Rather than needing to be retrofitted to work with MPI, an InfiniBand fabric solution that has built-in collective performance acceleration allows standard collective algorithms to work as intended, including support for all MPIs and MPI collective algorithms. This type of acceleration does not require any special adapter and/or fabric-based collective acceleration to obtain optimal performance and scale.
Performance Comparisons
The following tests provide representative comparisons of collective performance, including some application-level benchmark comparisons. The results come from published data, actual customer deployments, and QLogic tests.
Collective Test – Barrier
The barrier collective synchronizes all processes within a communicator. A node calling it will be blocked until all the nodes within the group have called it, which is why optimal barrier collective performance is key to maintaining an HPC cluster performance at scale.
The following information is from a Voltaire white paper on collectives acceleration (see References).
The benchmark results in Table 1 are based on the IMB Pallas collectives test. The adapter row in the table is representative of a conventional InfiniBand adapter, without any collective acceleration. The performance of the conventional InfiniBand Adapter is rather poor in comparison to the fabric-based and InfiniBand-based acceleration. The conventional InfiniBand Adapter shows latency as high as 3638 μs at 2048 cores, which is 168 times higher than the natively-accelerated InfiniBand. It is important to point out that as the size of the HPC cluster increases, so does the relative performance advantage of the accelerated InfiniBand architecture.
Figure 1: Performance Results – Collective Barrier Test
The natively-accelerated InfiniBand offers very good collective barrier performance without any special acceleration code or hardware assist. Figure 1 shows that it offers better latency than fabric-accelerated InfiniBand.
Collective Test – AllReduce
The AllReduce, after its operation, consolidates into its receive buffer the result of the pair-wise reduction of the send buffers of all processes, including its own. The operation is complete after it receives the results from all of the appropriate processes from the nodes/cores from around the cluster. This is why the performance of the AllReduce collective is another collective operation that is key to the ability of an application to scale, especially on a large cluster.
The following analysis is based on information from the Voltaire white paper on collective acceleration that was previously referenced.
Once again, the conventional InfiniBand Adapter line shows an extremely high latency at scale—3467 μs at 2048 cores. The fabric-accelerated solution offers better performance at 24 μs at 2048 cores. However, the best performance is provided by the natively-accelerated InfiniBand-based implementation at 22.6 μs at 2048 cores.
Figure 1: Performance Results – Collective AllReduce Test
The natively-accelerated InfiniBand implementation with AllReduce collectives once again offers excellent performance without any special acceleration code or hardware assist.
Collective Performance at Scale
Collective performance is one of the major factors in determining the ability for a cluster and applications running on the cluster to scale. Collective performance testing of the natively accelerated InfiniBand shows near-perfect scaling for collectives performance on a cluster of more than 14,000 cores.
Application Performance
The ANSYS® FLUENT® computational fluid dynamics application is designed to scale on HPC clusters. ANSYS has one of the industry’s best test suites that shows performance on different types of clusters and interconnects. ANSYS benchmark tests use a “Rating” result, where higher is better. The following analysis uses information provided in the Voltaire collective acceleration white paper that was previously referenced.
Figure 2: Eddy 417k Cell Model
The Eddy 417K model is a relatively small simulation, but it is an excellent test that shows off the potential performance of an interconnect. The reason for this is that a model of this size, when it is divided up over the nodes and cores on a cluster, spends very little time in processing and a disproportionate amount of time in communications. The more powerful the interconnect, the better the performance will be in the Eddy 417K test. In this test, conventional InfiniBand with no collective acceleration is equal to 1.0. The fabric-accelerated InfiniBand is 1.32 or 32 percent faster than conventional InfiniBand. The natively-accelerated InfiniBand is 1.73 or 73 percent faster than the conventional InfiniBand and more than 30 percent faster than fabric-accelerated collectives.
Figure 3: Aircraft 2M Cell Model
The Aircraft 2M test is a small-to-medium size benchmark model. In this case, natively-accelerated InfiniBand achieves an 87 percent performance advantage over the conventional InfiniBand and a 62 percent advantage over fabric-based acceleration.
Figure 4: Truck 111M Cell Model
The Truck 111M model is a relatively large benchmark test, which is less dependent on the interconnect. The reason for this is that there are a large number of cells that need to be processed per node and core, which means as a proportion there is less communication dependency.
Even in this case, the natively-accelerated InfiniBand achieves a 26 percent advantage over the conventional InfiniBand and an 8 percent advantage over the fabric-based approach.
Conclusion
Collective performance along with scalable latency and non-coalesced message rate performance determine the ability for a cluster to scale. An interconnect that is properly designed for HPC should not need special add-on collective acceleration. TrueScale InfiniBand was designed from the ground up for the HPC market and offers standard/built-in collective acceleration that achieves near-perfect collective scaling from a few nodes/cores to thousands of nodes/cores. Again, this is achieved without any special collective acceleration implementations.
About the Author:
Joe Yaworski is director of QLogic’s Global Alliance and Solution Marketing for QLogic. Within his Global Alliance responsibilities, he manages QLogic’s strategic partnerships and alliances in the High Performance Computing market space. Joe has helped build one of the industry’s broadest HPC ecosystems, which now includes alliances with over 70 companies. Joe’s Solution Marketing role is to help channel and alliance partners to create solution marketing programs that combined their offerings with QLogic’s HPC technologies. Also as part of his responsibilities, he directs the QLogic NETtrack Developer Center; which is used to test and certify partner applications and perform performance benchmarking.
Are these new verbs API function calls?
I too would like an explanation of just what this is. What does QLogic have that other vendors don’t? The marketing speak above says that this is standard and doesn’t require a special adaptor, so why doesn’t this work on all vendors’ products?
Also, the notion of using the switch to accelerate collective communication is something that Quadrics implemented many many years ago.
This article is no more than comparing apples to shoes. ORNL already demonstrated flat latency using the adapter based collectives offloads, and there was definitely no bottlenecks reported in their publications. Therefore numbers presented here are very questionable. The Fluent benchmarks lack any platform information, setup information and version information therefore not credible at all. This article is no more that a marketing propaganda, and I would refer to some papers published by credible organization for real comparison.
One big thing missing here for example is the purpose of offloading the collective communications – to take advantage of non blocking collectives and achieve nearly optimum communication overlap. I would suggest the author to do some homework by readying some of the publications made in the past by Quadrics and more recently by ORNL to learn more on the subject.
The Convention-IB and Voltaire FCA mentioned in the article are Verbs based. QLogic TrueScale IB Collective results are PSM based. PSM is a streamlined, lightweight, message based interface for MPI. As a result, PSM performance with MPI and in particular Collectives, does not require any special acceleration hardware or code. This also means that PSM’s native/inbuilt acceleration is available to all major MPI’s and for all collectives functions. PSM is a new capability in OFED that the QLogic TrueScale IB is designed to take advantage of.
The ORNL research is very, very limited in scale, given that their research was done on just 8 nodes. Otherwise, the information would have been used. The collectives performance numbers at scale came directly from the Mellanox/Voltaire white paper on FCA. The TrueScale collective performance numbers come from QLogic’s runs on ~2000 node Westmere cluster at LLNL. As for the benefits of “offloading” here is a direct quote from that same Mellanox/Voltaire white paper. “A common approach to this challenge improves the implementation of MPI collective operations by using intelligent or programmable network interfaces to offload the burden of communication activities from the host processor(s) within the NIC. Such implementations have shown significant improvement for micro-benchmarks that isolate collective communication performance, but these results have not translated to significant increases in performance for real applications.”
BTW, the TrueScale Fluent numbers are official results published on the ANSYS FLUENT benchmarks site (http://www.ansys.com/Support/Platform+Support/Benchmarks+Overview), which gives the platform information. The Mellanox & Voltiare FLUENT come from the Mellanox/Volatire whitepaper and there are also results published on the ANSYS FLUENT site.
It is wonderful to see how you use the previous Voltaire marketing against the Mellanox solution to bash the Mellanox offloading … marketing work at its best…. I would not use those statements if you want to write something meaningful. It really reduces from your credibility. By the way, ORNL did publish much higher node count results – check it out.
Now to the serious discussion – using light or lighter software stacks is not something new – Cray has their own version, as well as IBM. It is no more than creating proprietary software interfaces that connects to the hardware interface of the NIC. Verbs interface can work great on the Mellanox/Voltaire solutions and bad on QLogic, and PSM can work great on QLogic and bad on Mellanox. The question is what is proprietary and what is not, what is part of an open specification and what is not, and might go the same way Quadrics went.
By the way, you did not answer my Fluent questions – are you comparing the same platforms, same OS etc? saying that the your numbers are on Fluent web site does not help to understand if you are comparing apples to shoes… please also note that Eddy benchmark is no longer important to us – it is too small and no one uses it.
Ben, please do not get upset at this article beause of the work that Voltaire did in testing and analyzing the collective performance of Mellanox technology. I would suggest that you discuss your concerns with the Voltaire personnel that did this testing and came to these conclusions. I would hope that the Voltaire results are factual, and they are not “marketing work at best”.
As for open standards, PSM is as open as Verbs. Both are published in OFED and require a specific adapter architecture to run.
If you had gone to the FLUENT benchmark site (link provided above), you will find that the performance of systems with TrueScale IB, single rail to single rail IB implementations, are best in class at 16-nodes+ for FLUENT 12.1. In fact, in most cases, these systems with TrueScale IB also provide better performance than dual rail IB (non-TrueScale) implementations. The reason for this, in part, is due to the native collective performance of TrueScale with PSM that was covered in this article.
Ben we could go on forever commenting about this. The conclusion is that the performance results (Voltaire and ANSYS) are published and accessable to everyone.
Yes, you are right, we can argue forever, but as long as the data is presented correctly, I will not have any issues. I did review the data on the Fluent web site, and it is full of different platforms and setting. However, I encourage you to visit the same page, since there are other InfiniBand numbers (I assume that what is not marked as TrueScale is either from Voltaire or Mellanox) which are better than the numbers you listed in your article, so I assume that you have carefully picked not the best numbers listed on the page. I have also noticed that with the Aircraft benchmark, QLogic did NOT scale beyond 16 nodes – the performance at 16 nodes was 9573.4 and the performance at 32 nodes was 3712.1 (higher is better) – so the performance actually dropped from 16 nodes to 32 nodes, while all other Linux based solution did scale. This info you did not capture in your nice article. So I guess that if I want to build more than 16 nodes with InfiniBand, I should shop somewhere else.
This discussion on PSM is giving me PMS
Digging around the inter-webs, Performance Scaled Messaging (PSM) is indeed in OFED, but only QLogic appears to support it. It looks similar to Myricom’s MX library.
It’s a shame most 10 GigE vendors aren’t as obsessive over software as Myricom. Do other InfiniBand vendors plan to support PSM?
If there isn’t wider support for PSM, then non-MPI users will probably just stick to verbs. (And if the goal is really just to have TCP, most customers will stick with vanilla 10 Gig E. Keep in mind who your *real* competitor is before even bothering with something as superficial as performance.)
From what I have learned after checking around, there are no plans from Voltaire/Mellanox to adopt PSM but to stick with the IB standard verbs specification. As you said Chris, PSM looks more like Myricom MX, and probably will stay like that.
Joe – I have looked at the Mellanox web site but did not see the document that you referred to. Can you point to the URL?
I’m confused that people are confused about PSM. PSM has always been a part of PathScale/QLogic’s InfiniBand offering, and PSM has been supported by all the major MPI implementations for several years now. PSM is how InfiniBand should have been designed in the first place.
Greg, is PSM the InfiniPath API? I was at an early PathScale customer (Australian National) years ago. From what I remember at the time, PathScale only offered MPI to customers initially and then rolled-out verbs support (back when OpenFabrics was known as “OpenIB”) just before the QLogic merger. Has PSM always been offered for customer use?
PSM is the library needed to implement MPI using InfiniPath’s (now TruScale’s) InfiniBand extension. It wasn’t exposed to outside users for the first year or so that InfiniPath was shipping. Last I looked, PSM was used to enable a bunch of MPI implementations, including OpenMPI, Platform MPI, and MVAPICH (part of OFED).
Christian Bell (now of Myricom) gets credit for making PSM a useful API, we did a terrible job of designing the API before he showed up.
So the comparison between PSM and Myrinet makes sense now. When people finally learn that proprietary is the wrong way???
Matt, PSM is *optional*. You can also use QLogic HCAs as a normal InfiniBand adaptor. The only difference is that PSM has a lot higher performance for small packets and collective operations. You can not achieve this higher performance using stock InfiniBand.
The standard that all HPC people care about the most is MPI. The most popular MPI implementations support PSM.