User Response: 10GbE Cluster Interconnect

Print Friendly, PDF & Email

Yesterday, we posted a summary and link to an article arguing for the general acceptance of 10GbE as a valid cluster interconnect platform.  We asked the readers to respond via comments with their independent research and comments.  We had hoped to get three or four responses.  Well, ask and ye shall receive.  Of the twelve comments, eleven were from readers.  There was so much great information, we felt it necessary to post the comments up by themselves.  Thanks to all who decided to respond with such great info!

Jeff Layton

I think the $500 is the cost of the NIC alone. You still need to buy the switch and the cables. So he’s a bit off on the per port costs.

Here’s a blog I wrote that might help (or might not).

http://www.delltechcenter.com/page/12-01-2008+-+10GigE+in+HPCC

Jeff Darcy

Back in May I did a quick survey of costs on the switch side. FWIW, here are the results from then; of course things will have changed since then, but it’s a start.

Force10 S2410 10GE: 24 ports for $18K or $750/port
Fujitsu XG700 10GE: 12 ports for $7K or $583/port
HP 2900-24G 10GE: 24 ports for $16.5K or $687/port
Myricom 10G-SW16LC-6C2ER 10GE: 16 ports for $12K or $750/port

Voltaire ISR9024 IB: 24 ports for $16.5K or $687/port
Melrow mumble IB: 24 ports for $16.5K or $687/port
SBS EIS-4024 IB: 24 ports for $7.5K or $312/port (SDR, others are DDR)

In short, DDR IB seems to come in slightly less than 10GE per port, at twice the data rate, while SDR IB is less than half the cost for the same data rate. I suspect the per-packet CPU utilization is also better on the IB side, though the software complexity is greater.

Disclaimer: I work for an HPC vendor (SiCortex) which has little interest in 10GE/MX/IB as a cluster interconnect since we have our own built in.

John Leidel

Thanks to the Jeffs for a great series of comments! Like I said, its been several years since I had the pleasure of quantifying the costs of various interconnect technologies [when I did this, 10GbE was $1200+ per port].

Jeff [Layton], I read your blog post w/ the associated perf tests. Very cool. I’m terribly interested in your GAMMA-MPI runs.

Jeff [Darcy], did you happen to do survey the current costs of NICs as well?

Joe Landman

The IB switch pricing for 24 port DDR switches is now sitting around $4k +/- some. So it is roughly (today) $167/port. Add in a 2m CX4 cable at $60-ish, and a DDR NIC at $500-ish, and connecting 1 server to one port on a 24 port switch will run you under $750/port in total.

Do the same analysis for 10 GbE. NICs about $500-ish, same CX4 cable. But the switches are still not cheap. The best price we have seen/heard anywhere per port (not CX4, so you have added transceiver costs, not a wise move IMO) is about $750. Rumors abound of $500/port switches somewhere.

So from a pure price play, 10 GbE still costs more, and will until inexpensive switches start coming out. Once that happens, we would expect that they would start taking over for IB. Until that happens, I don’t expect to see much change.

The 10 GbE stack is much easier to deal with than IB. Building OFED has been, up until very recently, a crap-shoot on anything but a small range of specific distro kernels. This was an unfortunate outgrowth of how OFED developed, but the situation is/has been improving. Unfortunately, you won’t get good things like NFS-over-RDMA without the modern kernels, which, are not supported (officially) by OFED (c.f. http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-docs/README.txt)

What we have found in testing is that single thread 10 GbE performance is ok, though multi-thread is quite good. Latency from our testing is comparible with normal IB latencies. But it is ethernet, so NFS does just work, without any appeal to RDMA to get it going. And TCP/IP just works, and works well, again, without suffering the (major) performance degradation of doing IP over IB.

But is all this worth the significantly higher price?

That decision we must leave to the consumer. 10 GbE price is a problem for HPCC systems, and hopefully someone is working at a way to lower the costs so it is reasonable. Because the price of IB is reasonable, and without a good reason to switch, it likely won’t happen (the stack pain is annoying, but livable, and the entire process of support is automatable).

Just my thoughts. We support everything. The Delta-V we showed in Pervasive Software’s booth was running iSCSI over 10 GbE as a target. We got a sustained 500 MB/s and 1800 IOPs out of it for their use. It works, it just costs more.

Joe Landman

Hmmm …. I must be missing something here. $167/port (IB) is higher than $400/port (10GbE)? MB’s also have IB HCAs (we are working on bids with units like this now). 10 GbE on MBs are relatively recent as compared to IB on MBs.

Also, the Arista switches need a transceiver, which the on-board MB NICs (10 GbE/IB HCA) don’t, nor do the CX based IB switches. The SFP’s do add to the cost per port.

Again, the aren’t needed on IB. So the cost is higher. If I am wrong, please, by all means, show the analysis.

Tommy Landry

Yeah, the cost is still an issue for some, and this is an interesting take on alternatives. My company actually has a product for expanding ports on testing equipment only (SPANs and TAPs). Seems we need a similar product for throughput traffic though, and at a lower cost than these.

Jeff Squyres

A few random points, in no particular order:

1. 10GbE and SDR IB are *NOT* the same data rate! This is a common
marketing misnomer. With 10GbE, you can actually push darn close to a
rate of 10Gb on the wire for large messages. IB uses 8/10 encoding,
so you automatically lose 20% of the bits on the wire to protocol
overhead — you’re down to 8Gb. Similarly, DDR is really only 16Gb of
delivered data performance; QDR is really 32Gb. I believe that this
point was also made in the original article.

2. RDMA actually gets you very little in terms of MPI (regardless of
whether it’s IB or iWARP or …). What an MPI implementation really
wants is hardware offload/assistance for message passing progress,
particularly of large messages. If that hardware assistance comes in
the form of RDMA, ok, fine. But to be blunt, MPI’s semantics are
better matched to other forms of hardware offload.

3. Indeed, with today’s OpenFabrics MPI implementations (including
Open MPI), RDMA’ing an entire large message all at once can be quite
expensive in terms of resource usage. Open MPI is capable of sending
large messages either as a single large RDMA or a number of smaller
sends and/or RDMAs. Which way works best for you is likely
application-specific: it depends on factors such as (but not limited
to):

– how much registered memory your application is using in other
pending communications
– what the frequency of your communication is
– how many peers you’re sending to
– what communication/computation overlap you need
– how often you invoke MPI functions that trip the internal
progression engine
– …etc.

So don’t get hung up on specific technologies like RDMA. RDMA is not
the be-all/end-all technology for HPC. In some cases, it’s not even a
very good technology (!). Hardware offload is what is key (IMNSHO),
and there are many different flavors to choose from.

But at the end of the day, what you want to know is what will perform
well *for your application*. For example, here’s a very, very
coarse-grained set of questions that may start you down an analysis
path for your needs: for your application(s)…

– is 1Gb/high latency sufficient?
– is 8Gb/low latency sufficient?
– is 10Gb/low latency sufficient?
– is 10Gb/medium latency sufficient?
– is 10Gb/high latency sufficient?
– is 16Gb/low latency sufficient?
– is 32Gb/low latency sufficient?

…and be sure to multiply that out if you plan to put more than one
active network port in each server — especially as core counts go up!
It’s insane (IMNSHO) to have one network port for 16 cores and assume
that you won’t drop off overall network performance when all 16 MPI
processes (or even 8… or possibly even 4!) are simultaneously
pushing either large messages or large numbers of small messages.
Also make sure you get the math and server topology right such that
you can actually push (N x one_port_bandwidth) with the desired
latency into your fabric, yadda yadda yadda…

My point of this long ramble: it’s not about RDMA. It’s not even
[entirely] about price. Look into exactly what you’re going to use
your HPC resources for — what problems are you going to solve and how
*EXACTLY* you are going to solve them. Which MPI will you use? What
application(s)? What communication pattern(s) do they use? What
network topology fits that? Do you *need* low latency? Do you *need*
high bandwidth? In short: find the best technologies that fit your
needs, not the coolest/hottest/bigger-than-your-rival’s technologies.
Spend a little time on a quantitative analysis of your needs; you’ll
save lots of money over the long run because you’ll get a solution
that works best for exactly what you’re trying to do.

Jeff Squyres

GAMMA is neat, but the best kept secret in HPC is OpenMX (www.open-mx.org). It’s a software implementation of Myricom’s hardware MX NICs — it uses the Linux ethernet driver, so it works with whatever ethernet NIC you have (1Gb or 10Gb).

Specifically, MX is just frames over ethernet, regardless of whether they are being pushed via software or hardware. In a datacenter (i.e., HPC cluster), frames over ethernet is all you need — you don’t need the huge/complex TCP stack (or other network stacks).

I *STRONGLY* encourage everyone to give OpenMX a whirl; let’s get the bugs shaken out and get people using it. I saw some *very* promising latency numbers out of OpenMX and “reasonable” 10Gb NICs (I’m not going to quote numbers because I’m a vendor and I don’t want my ran-it-in-the-lab numbers to be taken authoritatively); I’ve even heard anecdotal stories of real-world MPI apps getting nice speedup over *1* (yes, *one*) GbE!

Note that OpenMX and MX are both API and wire-line compatible, so you can have OpenMX on one side and MX on the other (nifty!). Therefore, Open MPI natively supports OpenMX because — well, it’s just MX, and we’ve supported that for a long time.

http://www.open-mx.org/

Nathan Schrenk

Joe,

Arista Networks switches can be used with SFP+ 1X twinax copper cables, which are less expensive than Infiniband 4X CX4 cables. The SFP+ twinax copper cables have an SFP+ connector on each end, so no additional transceiver is required.

Tommy Landry

Kudos to Jeff Squyres for a well thought out response. Thanks for sharing that viewpoint.

Jeff Darcy

Since nobody seems to have linked to it here yet, here’s a big list of 10GbE products, with all the pricing information you could want.

http://www.10gbe.net/

Comments

  1. I used to do technical marketing @ Woven, and during that time did some benchmarks comparing 10GbE and DDR Infiniband.

    Running HPL over 10GbE (NetEffect NICs + Woven Switch), and HPL over 4X DDR produced results that were indistinguishable from each other, at a reasonable node count.

    That’s not to say that Infiniband isn’t faster than 10GbE, but it does suggest that when you take everything into account (PCI-E bus, driver overhead/efficiency, etc…), with real or near-real applications, using moderate packet sizes, the actual difference between the two is minimal for most real-world cases.

    Of course, that should change somewhat as PCI-Express 2.0 and QDR Infiniband gain market traction, but I’m guessing that in real applications, unless you’re running something like Amber, it won’t make much difference, performance wise.

  2. Serge Polevitzky says

    … isn’t Infiniband (IB) full duplex? … so even if you have to pay the 10-to-8 reduction penalty (and there are plenty of preamble and postamble ethernet overhead bits you need to push, too), you can potentially get data flowing simultaneously in both directions. So this would seem to be a plus for IB. Also, aren’t the latencies for even 10GbE considerably higher than IB ? — FWIW, Serge

  3. Scott Atchley says

    “Also, aren’t the latencies for even 10GbE considerably higher than IB ”

    You are confusing Ethernet and TCP/IP over Ethernet. With MX over Ethernet and a good low-latency switch, you can get 2 us.

  4. Terry Hulett says

    Latency on a quiesced system with one active connection is 2-3X higher today on iWARP (RDMA/TCP/Ethernet) than it is on IB. However, with an active system and 8 simultaneously active connections the latency difference is indistinguishable. This observation is backed by the fact that many (if not most) applications have similar run times on identical clusters with the two different interconnects.

    It is the case today that 10GbE is a more expensive to deploy than IB. Therefore, it is incumbent on the DC manager to decide between deployment costs, TCO and manageability.

  5. There are items here that need to be fixed, since they are wrong. I need to say that I am working in a company that sells both InfiniBand and 10G Ethernet.

    – Price – switch prices depends on the switch configuration and the margin that the vendor take. You can find 24 port IB DDR switches from $3000 to $4000 to $5000. The statement on IB 24 port DDR switch in $16K seems to be unreal (else it is a golden switch….). The 10G 24 port switches are in the range of $16k-$24k, so if you look on price, IB is still much cheaper. Even IB QDR is cheaper than that.

    – Performance – 10G is 10Gb/s real data rate, IB DDR is 16Gb/s and QDR is 32Gb/s. Latency – if you use standard IB vs standard 10G, it is 1us on IB vs 8-10us with 10G. You can run MX over 10G link layer to get lower latency, but this require MX on both sides, and still, it will be higher than the 2u stated above (the best Eth switches are 600ns per switch hop, and MX itself is not lower than 2us…, by the way Cisco switch latency is around 3us per hop…. definitely not good for MPI….). When you run 8 jobs at the same time, you will still get the 1us per job, but with 10G the latency will increase with job count. Bottom line, InfiniBand is still the best performance interconnect, and probably will stay like that in the next years.

    – Woven – unlike 10G NICs, you can find a variety of IB adapters, from SDR to QDR, and from PCIe x4 Gen1 to PCIe x8 Gen2. There are multiple flavors, each with different latency capabilities as well. If you take the lowest performance DDR card (and the cheapest of course), and run it in a PCIe x4 interface, you will get lower bandwidth than 10G…. marketing tricks to show that 10G is not worse than IB.

    – RDMA – not all application uses RDMA, but some does for good reasons. Not many people know that you get zero copy with InfiniBand if you are using RDMA OR Send/Receive. The different between RDMA and Send/Receive on IB is in the CPU overhead on the remote side (with RDMA the remote CPU is not innovated in the data transaction).

    – 10GBaseT – this was the promise of 10G – use the “same” cables as 1G. Good motivation, the cables will be cheaper, but the switch cost much, much, much more. Do your own math. If you need the cat6 cables, go with the 10GBaseT, but the switch/NICs will be more expensive than the SFP+ ones.

    – Cables – most of the installation will go SFP+ for 10G, CX4 for IB DDR and QSFP for IB QDR. at the end of the day, the cable cost is the same for both technologies

    Enough for now ….. 🙂

  6. Here’s the thing: While Infiniband is here to stay, the ubiquity of 10GbE is inevitable… it’s just a question of time. And the tipping point is going to be 10GbE over RJ-45, and multi-speed NICs in servers.

    And no-one should underestimate the importance of being able to run your interconnect over CAT-6/7, if only for the reason that you can cut your own cables with CAT, as opposed to having to buy expensive fibre/cx-4 cabling, that cannot easily be cut to length nor repaired in the field. Hell, cables are the single main reason the blade business exists.

    Remember, we’ve seen this before with regular GbE. One minute, each NIC is $700, and the switches are hugely expensive, and the next the NICs are embedded on servers, and we’re buying 24port switches from Dell at less than $200/port. Quanta is already building low cost 24-port 10GbE switches for their OEMs.

    Those who forget their history are doomed to repeat it.

  7. Jeff Layton says

    @John Casu – I hope you’re right, I really really do. I would love to have a high-speed network like 10GigE on my systems at GigE prices. But for the coming future I can get inexpensive SDR on my systems at a price point that 10GigE can’t touch.

    I’ve been waiting for almost 5 years for inexpensive 10GigE. Every year, the vendors keep saying, “it’s here, it’s here!” and the costs just are dropping fast enough.

    I lived through the GigE price drop and that was fairly easy to see coming. But I just can’t see inexpensive 10GigE coming. The NICs are still too expensive and the switch costs are just too high (As I mentioned before, looking at 24-port switch prices for 10GigE is misleading at best. Building multi-tiered Ethernet switches from 24-port switches will just kill performance).

    So I’m hoping someone comes along and sprinkles magic pixie dust on 10GigE and the prices drop to an acceptable level. Until then I just don’t see it being competitive to IB in HPC.

    BTW – in the company I work for, we are seeing a resurgence in IB on the enterprise side because of the performance for systems with lots of VM’s and the fact that 10GigE is just not coming down in price like people would like.

  8. For now, you’re absolutely right. Infiniband is going to be the cost leader for at least the next 18 months.

    But once 10GbE over CAT takes hold (and it is a question of when, not if, imho), it’ll happen very quickly, because 10GbaseT is a natural evolution of the overwhelmingly ubiquitous network technology whose use is driven by so many things outside HPC.

    Actually, I’m going to change my point slightly, because I think 10GbE dominance is also dependent on when Intel & Broadcom really decide to get into the market, and when Dell decides it’s time for 10GbE.

    Also, in my experience, there’s an innate resistance to 10GbaseT, among the 10GbE startups, precisely because it will drive prices way down.

    The real question, in my mind, is that when 10GbE does become ubiquitous, will it still be considered a high speed interconnect technology? Will Mellanox, Myricom and other have moved on to the next great thing? I hope so.

  9. Scott Atchley says

    @Gilad

    You can run MX over 10G link layer to get lower latency, but this require MX on both sides,

    As opposed to running IB on one-side? What is on the other side?

    and still, it will be higher than the 2u stated above (the best Eth switches are 600ns per switch hop,

    You should try a Fujitsu switch. They are less than 450 ns.

    and MX itself is not lower than 2us…,

    Latest NICs, latest CPUs, just under 2 us…

    by the way Cisco switch latency is around 3us per hop…. definitely not good for MPI….).

    You said it was fine when you were selling SDR NICs…

    When you run 8 jobs at the same time, you will still get the 1us per job, but with 10G the latency will increase with job count. Bottom line, InfiniBand is still the best performance interconnect, and probably will stay like that in the next years.

    Best micro-benchmark performance, perhaps. Let’s talk alltoall on a large fabric. 🙂

    Scott

  10. Scott..

    Gilad is the VP of Technical Marketing @ Mellanox. That’s all.