10GbE is Ready for Your Cluster. Or is it?

Linux Magazine has an article written by Dan Tuchler detailing why he thinks 10-gigabit Ethernet should be a more widely considered for vanilla HPC cluster installations.  Considering the vast majority of of cluster installations fall outside of the realm of the Top500 list, many of us tend to forget that the average HPC user doesn’t have terabits of interconnect bandwidth.  They’re simply using gigabit ethernet.  Tuchler argues that this high comfort level with Ethernet technologies, coupled with the sinking costs of 10GbE make the technology ripe for an interconnect platform.

As a widely-used standard, Ethernet is a known environment for IT executives, network administrators, server vendors, and managed service providers around the world. They have the tools to manage it and the knowledge to maintain it. Broad vendor support is also a plus – almost all vendors support Ethernet.

I somewhat agree with Tuchler’s point of view.  Five years ago 10GbE prices were so far out in the stratosphere that rarely would you ever have the funds to purchase a switch.  The prices *are* finally coming down to reasonable levels.  However, so are the prices of other common cluster interconnects such as Myrinet and Infiniband.  Tuchler quotes $500 per port on 10GbE which is very close to the current Infiniband cost basis.  So why go 10GbE when you can buy Infiniband with native RDMA capabilities and an integrated IP stack?  [this is really a question folks, I’m not being sarcastic].

Feel free to leave your comments on this one.  I’m interested to hear what the audience feels about this debate.  For more info, read Dan’s article here.

Trackbacks

  1. […] West at InsideHPC asks about 10 GbE on clusters. The point I made (in two posts), and we verify every time we spec a system out for a customer, is […]

  2. […] the latest developments in 10GbE technology with respect to the HPC market.  We had quite a heated discussion some time ago on the same subject. Traditionally Ethernet has been considered a low cost and good […]

Comments

  1. Jeff Layton says

    I think the $500 is the cost of the NIC alone. You still need to buy the switch and the cables. So he’s a bit off on the per port costs.

    Here’s a blog I wrote that might help (or might not).

    http://www.delltechcenter.com/page/12-01-2008+-+10GigE+in+HPCC

    Jeff

  2. Back in May I did a quick survey of costs on the switch side. FWIW, here are the results from then; of course things will have changed since then, but it’s a start.

    Force10 S2410 10GE: 24 ports for $18K or $750/port
    Fujitsu XG700 10GE: 12 ports for $7K or $583/port
    HP 2900-24G 10GE: 24 ports for $16.5K or $687/port
    Myricom 10G-SW16LC-6C2ER 10GE: 16 ports for $12K or $750/port

    Voltaire ISR9024 IB: 24 ports for $16.5K or $687/port
    Melrow mumble IB: 24 ports for $16.5K or $687/port
    SBS EIS-4024 IB: 24 ports for $7.5K or $312/port (SDR, others are DDR)

    In short, DDR IB seems to come in slightly less than 10GE per port, at twice the data rate, while SDR IB is less than half the cost for the same data rate. I suspect the per-packet CPU utilization is also better on the IB side, though the software complexity is greater.

    Disclaimer: I work for an HPC vendor (SiCortex) which has little interest in 10GE/MX/IB as a cluster interconnect since we have our own built in.

  3. John Leidel says

    Thanks to the Jeffs for a great series of comments! Like I said, its been several years since I had the pleasure of quantifying the costs of various interconnect technologies [when I did this, 10GbE was $1200+ per port].

    Jeff [Layton], I read your blog post w/ the associated perf tests. Very cool. I’m terribly interested in your GAMMA-MPI runs.

    Jeff [Darcy], did you happen to do survey the current costs of NICs as well?

  4. The IB switch pricing for 24 port DDR switches is now sitting around $4k +/- some. So it is roughly (today) $167/port. Add in a 2m CX4 cable at $60-ish, and a DDR NIC at $500-ish, and connecting 1 server to one port on a 24 port switch will run you under $750/port in total.

    Do the same analysis for 10 GbE. NICs about $500-ish, same CX4 cable. But the switches are still not cheap. The best price we have seen/heard anywhere per port (not CX4, so you have added transceiver costs, not a wise move IMO) is about $750. Rumors abound of $500/port switches somewhere.

    So from a pure price play, 10 GbE still costs more, and will until inexpensive switches start coming out. Once that happens, we would expect that they would start taking over for IB. Until that happens, I don’t expect to see much change.

    The 10 GbE stack is much easier to deal with than IB. Building OFED has been, up until very recently, a crap-shoot on anything but a small range of specific distro kernels. This was an unfortunate outgrowth of how OFED developed, but the situation is/has been improving. Unfortunately, you won’t get good things like NFS-over-RDMA without the modern kernels, which, are not supported (officially) by OFED (c.f. http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-docs/README.txt)

    What we have found in testing is that single thread 10 GbE performance is ok, though multi-thread is quite good. Latency from our testing is comparible with normal IB latencies. But it is ethernet, so NFS does just work, without any appeal to RDMA to get it going. And TCP/IP just works, and works well, again, without suffering the (major) performance degradation of doing IP over IB.

    But is all this worth the significantly higher price?

    That decision we must leave to the consumer. 10 GbE price is a problem for HPCC systems, and hopefully someone is working at a way to lower the costs so it is reasonable. Because the price of IB is reasonable, and without a good reason to switch, it likely won’t happen (the stack pain is annoying, but livable, and the entire process of support is automatable).

    Just my thoughts. We support everything. The Delta-V we showed in Pervasive Software’s booth was running iSCSI over 10 GbE as a target. We got a sustained 500 MB/s and 1800 IOPs out of it for their use. It works, it just costs more.

  5. Hmmm …. I must be missing something here. $167/port (IB) is higher than $400/port (10GbE)? MB’s also have IB HCAs (we are working on bids with units like this now). 10 GbE on MBs are relatively recent as compared to IB on MBs.

    Also, the Arista switches need a transceiver, which the on-board MB NICs (10 GbE/IB HCA) don’t, nor do the CX based IB switches. The SFP’s do add to the cost per port.

    Again, the aren’t needed on IB. So the cost is higher. If I am wrong, please, by all means, show the analysis.

  6. Yeah, the cost is still an issue for some, and this is an interesting take on alternatives. My company actually has a product for expanding ports on testing equipment only (SPANs and TAPs). Seems we need a similar product for throughput traffic though, and at a lower cost than these.

  7. A few random points, in no particular order:

    1. 10GbE and SDR IB are *NOT* the same data rate! This is a common
    marketing misnomer. With 10GbE, you can actually push darn close to a
    rate of 10Gb on the wire for large messages. IB uses 8/10 encoding,
    so you automatically lose 20% of the bits on the wire to protocol
    overhead — you’re down to 8Gb. Similarly, DDR is really only 16Gb of
    delivered data performance; QDR is really 32Gb. I believe that this
    point was also made in the original article.

    2. RDMA actually gets you very little in terms of MPI (regardless of
    whether it’s IB or iWARP or …). What an MPI implementation really
    wants is hardware offload/assistance for message passing progress,
    particularly of large messages. If that hardware assistance comes in
    the form of RDMA, ok, fine. But to be blunt, MPI’s semantics are
    better matched to other forms of hardware offload.

    3. Indeed, with today’s OpenFabrics MPI implementations (including
    Open MPI), RDMA’ing an entire large message all at once can be quite
    expensive in terms of resource usage. Open MPI is capable of sending
    large messages either as a single large RDMA or a number of smaller
    sends and/or RDMAs. Which way works best for you is likely
    application-specific: it depends on factors such as (but not limited
    to):

    – how much registered memory your application is using in other
    pending communications
    – what the frequency of your communication is
    – how many peers you’re sending to
    – what communication/computation overlap you need
    – how often you invoke MPI functions that trip the internal
    progression engine
    – …etc.

    So don’t get hung up on specific technologies like RDMA. RDMA is not
    the be-all/end-all technology for HPC. In some cases, it’s not even a
    very good technology (!). Hardware offload is what is key (IMNSHO),
    and there are many different flavors to choose from.

    But at the end of the day, what you want to know is what will perform
    well *for your application*. For example, here’s a very, very
    coarse-grained set of questions that may start you down an analysis
    path for your needs: for your application(s)…

    – is 1Gb/high latency sufficient?
    – is 8Gb/low latency sufficient?
    – is 10Gb/low latency sufficient?
    – is 10Gb/medium latency sufficient?
    – is 10Gb/high latency sufficient?
    – is 16Gb/low latency sufficient?
    – is 32Gb/low latency sufficient?

    …and be sure to multiply that out if you plan to put more than one
    active network port in each server — especially as core counts go up!
    It’s insane (IMNSHO) to have one network port for 16 cores and assume
    that you won’t drop off overall network performance when all 16 MPI
    processes (or even 8… or possibly even 4!) are simultaneously
    pushing either large messages or large numbers of small messages.
    Also make sure you get the math and server topology right such that
    you can actually push (N x one_port_bandwidth) with the desired
    latency into your fabric, yadda yadda yadda…

    My point of this long ramble: it’s not about RDMA. It’s not even
    [entirely] about price. Look into exactly what you’re going to use
    your HPC resources for — what problems are you going to solve and how
    *EXACTLY* you are going to solve them. Which MPI will you use? What
    application(s)? What communication pattern(s) do they use? What
    network topology fits that? Do you *need* low latency? Do you *need*
    high bandwidth? In short: find the best technologies that fit your
    needs, not the coolest/hottest/bigger-than-your-rival’s technologies.
    Spend a little time on a quantitative analysis of your needs; you’ll
    save lots of money over the long run because you’ll get a solution
    that works best for exactly what you’re trying to do.

  8. GAMMA is neat, but the best kept secret in HPC is OpenMX (www.open-mx.org). It’s a software implementation of Myricom’s hardware MX NICs — it uses the Linux ethernet driver, so it works with whatever ethernet NIC you have (1Gb or 10Gb).

    Specifically, MX is just frames over ethernet, regardless of whether they are being pushed via software or hardware. In a datacenter (i.e., HPC cluster), frames over ethernet is all you need — you don’t need the huge/complex TCP stack (or other network stacks).

    I *STRONGLY* encourage everyone to give OpenMX a whirl; let’s get the bugs shaken out and get people using it. I saw some *very* promising latency numbers out of OpenMX and “reasonable” 10Gb NICs (I’m not going to quote numbers because I’m a vendor and I don’t want my ran-it-in-the-lab numbers to be taken authoritatively); I’ve even heard anecdotal stories of real-world MPI apps getting nice speedup over *1* (yes, *one*) GbE!

    Note that OpenMX and MX are both API and wire-line compatible, so you can have OpenMX on one side and MX on the other (nifty!). Therefore, Open MPI natively supports OpenMX because — well, it’s just MX, and we’ve supported that for a long time.

    http://www.open-mx.org/

  9. Joe,

    Arista Networks switches can be used with SFP+ 1X twinax copper cables, which are less expensive than Infiniband 4X CX4 cables. The SFP+ twinax copper cables have an SFP+ connector on each end, so no additional transceiver is required.

  10. Kudos to Jeff Squyres for a well thought out response. Thanks for sharing that viewpoint.

  11. Since nobody seems to have linked to it here yet, here’s a big list of 10GbE products, with all the pricing information you could want.

    http://www.10gbe.net/

  12. Jeff Layton says

    To follow Jeff’s randomness… 🙂

    (1) I think Jeff Squyres makes some great points. It’s not always about price, it’s about performance and what you are doing with the fabric. There are lots of things that go into a good solution.

    (2) Pricing out small switches for 10GigE shows that 10GigE is approaching IB in price. Where it gets really fun, and Joe alluded to this, is when you start talking about larger fabrics. In my experience when you get to larger fabrics, the price per port goes up faster than the port count (i.e. it’s gets pretty darn expensive). Plus, if you’re running TCP, you have to start worrying about spanning tree latencies if you go for multi-tier switching (I know Woven says they have a solution for this but to be honest I don’t know much about it. I think there are others that claim to have fixed this problem – just haven’t seen much on this yet). So building multi-tier TCP fabrics is not really pretty from a micro benchmark perspective.

    (3) One other quick point – the GAMMA charts are from Doug Eadline. He and I were working on a project over at ClusterMonkey (shameless plug) and Doug was testing GAMMA. The results are pretty cool (IMHO).

    Doug is now testing Open-MX. I agree with Jeff S. that Open-MX is pretty nifty and an under -ppreciated possibility for people. GAMMA doesn’t allow you to mix TCP and GAMMA traffic on the same port. However Open-MX does allow you to mix traffic. Doug is still testing and there were a few little weird things happening in performance testing, but I think Doug has most of those ironed out. He’s working on an article for ClusterMonkey in the near future to present his results.

    (4) I personally think using something other than TCP gives Ethernet some new life. Open-MX or GAMMA over GigE allows applications to scale a bit further and run faster. Running non-TCP over 10GigE is also something to seriously consider. There are still some issues about fabric configuration, but you can drop the latency for 10GigE to some pretty low levels.

    (5) IB is still running strong within HPC even for smaller systems. SDR, while not 10Gb/s (it’s 8Gb/s) is priced pretty low and is very attractive for smaller systems. DDR (at 16Gb/s as Jeff reminded all of us) might perhaps come down in price with QDR (32Gb/ as Jeff pointed out) coming into the marker more and more now. Pretty amazing performance with IB.

    (6) IB is also great for smaller systems with ScaleMP. With ScaleMP you can take the cluster nodes, connected with IB, and it appears like a large SMP system to the OS. You don’t have to install IB drivers or anything like that – ScaleMP takes care of that for you. You just run your MPI codes with something like shmemm as the device (no need for IB) and they run just fine. Pretty cool stuff.

  13. Dan Tuchler says

    There’s some really great information on this thread – I’m enjoying the exchange, and learning a lot.

    Just to clarify on the cost topic – we see several major server vendors beginning to incorporate 10 Gig Ethernet chips on the servers. Whether this makes them “free” or not is a matter of opinion, but certainly this will help drive 10 GE chip volumes up and push the costs way down, as happened with 1GE. The cable, whether CX4 or passive SFP+, is in the range of $50 +/- and coming down. And switches now list for under $500 a port: my company, BLADE Network Technologies, makes both blade-server resident and Top of Rack switches that list for under $500 a port (yes, that’s why we are so interested in observing 10GE adoption). Certainly there are cases for other interconnects as well – it’s good that users have choices.