Yesterday, we posted a summary and link to an article arguing for the general acceptance of 10GbE as a valid cluster interconnect platform. We asked the readers to respond via comments with their independent research and comments. We had hoped to get three or four responses. Well, ask and ye shall receive. Of the twelve comments, eleven were from readers. There was so much great information, we felt it necessary to post the comments up by themselves. Thanks to all who decided to respond with such great info!
I think the $500 is the cost of the NIC alone. You still need to buy the switch and the cables. So he’s a bit off on the per port costs.
Here’s a blog I wrote that might help (or might not).
Back in May I did a quick survey of costs on the switch side. FWIW, here are the results from then; of course things will have changed since then, but it’s a start.
Force10 S2410 10GE: 24 ports for $18K or $750/port
Fujitsu XG700 10GE: 12 ports for $7K or $583/port
HP 2900-24G 10GE: 24 ports for $16.5K or $687/port
Myricom 10G-SW16LC-6C2ER 10GE: 16 ports for $12K or $750/port
Voltaire ISR9024 IB: 24 ports for $16.5K or $687/port
Melrow mumble IB: 24 ports for $16.5K or $687/port
SBS EIS-4024 IB: 24 ports for $7.5K or $312/port (SDR, others are DDR)
In short, DDR IB seems to come in slightly less than 10GE per port, at twice the data rate, while SDR IB is less than half the cost for the same data rate. I suspect the per-packet CPU utilization is also better on the IB side, though the software complexity is greater.
Disclaimer: I work for an HPC vendor (SiCortex) which has little interest in 10GE/MX/IB as a cluster interconnect since we have our own built in.
Thanks to the Jeffs for a great series of comments! Like I said, its been several years since I had the pleasure of quantifying the costs of various interconnect technologies [when I did this, 10GbE was $1200+ per port].
Jeff [Layton], I read your blog post w/ the associated perf tests. Very cool. I’m terribly interested in your GAMMA-MPI runs.
Jeff [Darcy], did you happen to do survey the current costs of NICs as well?
The IB switch pricing for 24 port DDR switches is now sitting around $4k +/- some. So it is roughly (today) $167/port. Add in a 2m CX4 cable at $60-ish, and a DDR NIC at $500-ish, and connecting 1 server to one port on a 24 port switch will run you under $750/port in total.
Do the same analysis for 10 GbE. NICs about $500-ish, same CX4 cable. But the switches are still not cheap. The best price we have seen/heard anywhere per port (not CX4, so you have added transceiver costs, not a wise move IMO) is about $750. Rumors abound of $500/port switches somewhere.
So from a pure price play, 10 GbE still costs more, and will until inexpensive switches start coming out. Once that happens, we would expect that they would start taking over for IB. Until that happens, I don’t expect to see much change.
The 10 GbE stack is much easier to deal with than IB. Building OFED has been, up until very recently, a crap-shoot on anything but a small range of specific distro kernels. This was an unfortunate outgrowth of how OFED developed, but the situation is/has been improving. Unfortunately, you won’t get good things like NFS-over-RDMA without the modern kernels, which, are not supported (officially) by OFED (c.f. http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-docs/README.txt)
What we have found in testing is that single thread 10 GbE performance is ok, though multi-thread is quite good. Latency from our testing is comparible with normal IB latencies. But it is ethernet, so NFS does just work, without any appeal to RDMA to get it going. And TCP/IP just works, and works well, again, without suffering the (major) performance degradation of doing IP over IB.
But is all this worth the significantly higher price?
That decision we must leave to the consumer. 10 GbE price is a problem for HPCC systems, and hopefully someone is working at a way to lower the costs so it is reasonable. Because the price of IB is reasonable, and without a good reason to switch, it likely won’t happen (the stack pain is annoying, but livable, and the entire process of support is automatable).
Just my thoughts. We support everything. The Delta-V we showed in Pervasive Software’s booth was running iSCSI over 10 GbE as a target. We got a sustained 500 MB/s and 1800 IOPs out of it for their use. It works, it just costs more.
Hmmm …. I must be missing something here. $167/port (IB) is higher than $400/port (10GbE)? MB’s also have IB HCAs (we are working on bids with units like this now). 10 GbE on MBs are relatively recent as compared to IB on MBs.
Also, the Arista switches need a transceiver, which the on-board MB NICs (10 GbE/IB HCA) don’t, nor do the CX based IB switches. The SFP’s do add to the cost per port.
Again, the aren’t needed on IB. So the cost is higher. If I am wrong, please, by all means, show the analysis.
Yeah, the cost is still an issue for some, and this is an interesting take on alternatives. My company actually has a product for expanding ports on testing equipment only (SPANs and TAPs). Seems we need a similar product for throughput traffic though, and at a lower cost than these.
A few random points, in no particular order:
1. 10GbE and SDR IB are *NOT* the same data rate! This is a common
marketing misnomer. With 10GbE, you can actually push darn close to a
rate of 10Gb on the wire for large messages. IB uses 8/10 encoding,
so you automatically lose 20% of the bits on the wire to protocol
overhead — you’re down to 8Gb. Similarly, DDR is really only 16Gb of
delivered data performance; QDR is really 32Gb. I believe that this
point was also made in the original article.
2. RDMA actually gets you very little in terms of MPI (regardless of
whether it’s IB or iWARP or …). What an MPI implementation really
wants is hardware offload/assistance for message passing progress,
particularly of large messages. If that hardware assistance comes in
the form of RDMA, ok, fine. But to be blunt, MPI’s semantics are
better matched to other forms of hardware offload.
3. Indeed, with today’s OpenFabrics MPI implementations (including
Open MPI), RDMA’ing an entire large message all at once can be quite
expensive in terms of resource usage. Open MPI is capable of sending
large messages either as a single large RDMA or a number of smaller
sends and/or RDMAs. Which way works best for you is likely
application-specific: it depends on factors such as (but not limited
- how much registered memory your application is using in other
- what the frequency of your communication is
- how many peers you’re sending to
- what communication/computation overlap you need
- how often you invoke MPI functions that trip the internal
So don’t get hung up on specific technologies like RDMA. RDMA is not
the be-all/end-all technology for HPC. In some cases, it’s not even a
very good technology (!). Hardware offload is what is key (IMNSHO),
and there are many different flavors to choose from.
But at the end of the day, what you want to know is what will perform
well *for your application*. For example, here’s a very, very
coarse-grained set of questions that may start you down an analysis
path for your needs: for your application(s)…
- is 1Gb/high latency sufficient?
- is 8Gb/low latency sufficient?
- is 10Gb/low latency sufficient?
- is 10Gb/medium latency sufficient?
- is 10Gb/high latency sufficient?
- is 16Gb/low latency sufficient?
- is 32Gb/low latency sufficient?
…and be sure to multiply that out if you plan to put more than one
active network port in each server — especially as core counts go up!
It’s insane (IMNSHO) to have one network port for 16 cores and assume
that you won’t drop off overall network performance when all 16 MPI
processes (or even 8… or possibly even 4!) are simultaneously
pushing either large messages or large numbers of small messages.
Also make sure you get the math and server topology right such that
you can actually push (N x one_port_bandwidth) with the desired
latency into your fabric, yadda yadda yadda…
My point of this long ramble: it’s not about RDMA. It’s not even
[entirely] about price. Look into exactly what you’re going to use
your HPC resources for — what problems are you going to solve and how
*EXACTLY* you are going to solve them. Which MPI will you use? What
application(s)? What communication pattern(s) do they use? What
network topology fits that? Do you *need* low latency? Do you *need*
high bandwidth? In short: find the best technologies that fit your
needs, not the coolest/hottest/bigger-than-your-rival’s technologies.
Spend a little time on a quantitative analysis of your needs; you’ll
save lots of money over the long run because you’ll get a solution
that works best for exactly what you’re trying to do.
GAMMA is neat, but the best kept secret in HPC is OpenMX (www.open-mx.org). It’s a software implementation of Myricom’s hardware MX NICs — it uses the Linux ethernet driver, so it works with whatever ethernet NIC you have (1Gb or 10Gb).
Specifically, MX is just frames over ethernet, regardless of whether they are being pushed via software or hardware. In a datacenter (i.e., HPC cluster), frames over ethernet is all you need — you don’t need the huge/complex TCP stack (or other network stacks).
I *STRONGLY* encourage everyone to give OpenMX a whirl; let’s get the bugs shaken out and get people using it. I saw some *very* promising latency numbers out of OpenMX and “reasonable” 10Gb NICs (I’m not going to quote numbers because I’m a vendor and I don’t want my ran-it-in-the-lab numbers to be taken authoritatively); I’ve even heard anecdotal stories of real-world MPI apps getting nice speedup over *1* (yes, *one*) GbE!
Note that OpenMX and MX are both API and wire-line compatible, so you can have OpenMX on one side and MX on the other (nifty!). Therefore, Open MPI natively supports OpenMX because — well, it’s just MX, and we’ve supported that for a long time.
Arista Networks switches can be used with SFP+ 1X twinax copper cables, which are less expensive than Infiniband 4X CX4 cables. The SFP+ twinax copper cables have an SFP+ connector on each end, so no additional transceiver is required.
Kudos to Jeff Squyres for a well thought out response. Thanks for sharing that viewpoint.
Since nobody seems to have linked to it here yet, here’s a big list of 10GbE products, with all the pricing information you could want.