Reader responds to question of 10GbE adoption in HPC

Print Friendly, PDF & Email

The past two weeks or so have seen a lot of activity in comments and email about HPC interconnects, most particularly about the relative merits of 10 GbE and InfiniBand.

Earlier this month I pointed to Doug Eadline’s article at Linux Mag arguing that 10GbE will see increasing adoption in HPC

There are some things to consider, but in general, “Good Old Ethernet” is taking its next big jump. I am convinced that over the next year there will be a significant up-tick in 10 GigE HPC clusters.

In a move that we just love here at insideHPC world HQ (yes, my living room), reader Patrick Chevaux responded back and gave us permission to publish his thoughts:

I agree with the title of your article (10GE ready for HPC).

However I want to make a short comment about the following paragraph from Doug Eadline :

“My 10 GigE prediction is based on the following rule of thumb, Speed, Simplicity, Cost, pick an two. I believe 10 GigE will win because of simplicity and cost. IB is already faster and has better latency and if you need this level of performance you are not even looking at Ethernet direction. The joy of clustering is that one size does not fit all and you can build your cluster around your needs.”

InfiniBand has won the HPC niche market exactly because of the 3 parameters cited above.

As we all know InfiniBand has always been and still is ahead of Ethernet in terms of “speed” (throughput or goodput and latency). One could also argue about the inherent features like flow control and separate lanes which provide better traffic handling than good old Ethernet. With the open IB architecture and a number of ongoing developments being done within the IBTA (InfiniBand Trade Association,, the speed advantage is going to continue if not widen up. If not convinced look at the IB roadmap with QDR 8x, EDR and HDR coming soon.

Simplicity really means getting commercial hardware and software products that are proven, easy to use and provide the required high performance without having to design complex software schemes.

Ethernet is definitely easy to use with well-known integrated hardware and software stacks. The real problem is the use of old, inefficient protocols such as IP (V4), TCP, UDP and the like. As we all know performance of the IP stack is terrible and improving it is a major software burden.

InfiniBand on the other hand, shares the same basic hardware and software characteristics, silicon is widely available and cheap, drivers are integrated in operating systems, and user s can rely on high level programs like MPI to get very good interprocess performance

Cost is the key point for most users. And here, InfiniBand has won the battle with very little fight. HPC specialists and system Integrators, like ClusterVision, who used to rely on Gigabit Ethernet, Myrinet or Quadrics have quickly moved into selling only InfiniBand HPC interconnects, starting with InfiniBand SDR, moving to DDR in 2007 and now standardizing on InfiniBand QDR because of lower cost.

So, indeed the OpenFabrics Alliance has been doing the right thing in migrating all the InfiniBand “goodies” like : RDMA, User-mode I/O (zero-copy, kernel bypass), high throughput and low latency to Ethernet. Yet, InfiniBand is the undisputed leader in the Top500 list.

I see problems for future “data center” Ethernet in not having a proper hop-by-hop flow control scheme and not having an efficient traffic separation scheme like InfiniBand’s virtual lanes.

Finally I wish the 2 worlds will eventually converge :

  • InfiniBand and Ethernet PHYs are so similar that it would be stupid not to merge
  • InfiniBand has demonstrated that
    • cheap high performance host adapters can be made
    • cheap high performance switches can be made

I dream of 10GE, 40GE, 100GE, etc.. products providing the best of both worlds (high perf, low cost).

Patrick, thanks for taking the time to write in. And if you’ve got thoughts on this topic, please drop us a line or leave a comment on the blog. We love to hear back from you.


  1. I have an instant reaction to Patrick’s statement:

    “Simplicity really means getting commercial hardware and software products that are proven, easy to use and provide the required high performance without having to design complex software schemes.”

    I refer to slide 5 in my presentation at the Sonoma OpenFabrics conference this year:

    It’s a simple graph of lines of code in Open MPI for each different network transport. If you don’t want to open up the zip file, I’ll copy the numbers here:

    * MX: 1,210 and 2,331 (Open MPI has 2 different “flavors” of MX support)
    * Shared memory: 2,671
    * TCP: 4,159
    * OpenFabrics: 18,627

    Yes, that’s right — OpenFabrics requires 18 *THOUSAND* lines of code in Open MPI — TCP is less than a quarter of that. MX is less than half again of that. Note that the 18K number doesn’t even include any of the memory registration code that is only used by OpenFabrics (not TCP, shared memory, MX, …etc.). So 18K is actually below the real number.

    Therefore, I do believe that Doug’s statement is correct: Speed, Simplicity, Cost, pick any two. “Simplicity”, as defined by Patrick, is definitely NOT met by the API used by Infiniband (i.e., OpenFabrics verbs).

    (sorry, this just raises a favorite rallying point of mine — the OpenFabrics verbs API is actually *extremely* low level and quite difficult to write to for user-level applications. FWIW: it’s been (accurately, IMHO) remarked that OF verbs looks and feels much more like a kernel-level interface than a user-level interface)

  2. I’m glad there are armies of developers like Open-MPI expending 18k lines of code and much time to ensure that we can get good performance out of InfiniBand.

    Perhaps this explains why there’s little pickup of verbs interfaces in the enterprise: development time his a more important impact than on HPC were applications and middleware lives on for decades.

    Anyway, back to HPC. Since these verbs are indeed so low-level, one needs to consider the impact of how they evolve and are implemented by the InfiniBand community. Here is the life-cycle of how OpenFabrics adopts new features when, say, Vendor A has to address a shortcoming in the existing interface or simply evolve the interface for a newer generation of hardware:
    1. Vendor A implements new feature through a verb which, because of the low-level nature of the API, requires modifications to Vendor A’s entire stack (software down to hardware).
    2. Vendor A proposes that OpenFabrics adopt the verb so the community can take advantage of the said new feature.
    3. OpenFabrics adopts the extension and it is rolled out in the next release.
    4. Vendors B and C scramble to *emulate* the verb in time so they too can tout Open Fabrics support.

    Here’s the problem: Vendor A is the *only* InfiniBand vendor to extend and natively implement new verbs in their silicon. Therefore, InfiniBand *is* Vendor A while others are me too implementations thereof.

    Cisco realized they could only lose when evolution of InfiniBand became so tied to vendor A’s silicon so they dropped InfiniBand. Plus, there are already dozens of companies taking the said silicon and customizing it for a (demanding) HPC community. This makes for tough competition on details that don’t have to be dealt with in the Ethernet world where customers are low touch and requires less specialization.

    Now Vendor A wants to bring its vision of the world to Ethernet. It will be interesting to see how that goes. Ethernet has much more silicon providers and has always flourished within bounds that are more loosely defined than InfiniBand’s. Of course, the pace of adoption is much slower in the Ethernet world but this shouldn’t come as a surprise since there are many silicon providers to set the rules. One has to wonder how the Ethernet community as a whole would swallow the large set of silicon-dependent nuts and bolts called “OpenFabrics verbs”.

    On the flip side, the HPC community could benefit from Vendor A leveraging its Ethernet side of the business to help sustain its niche investments in evolving InfiniBand. We already know that we have valiant armies of programmers ready to take on that challenge.

  3. “Yet, InfiniBand is the undisputed leader in the Top500 list.” … Infiniband has never been the leader in the top500 list. I will accept your statement concerning the top 50 machines, but outside of them, Ethernet has the pie. One of the biggest problems with this community is that people who work in the top 20 machines see the entire list (and computing in general) through a very focused lens. This is as irresponsible as it is inaccurate.

    I would also like to point out that many good transports have come and gone due to many different reasons. The market would say that since QLogic and Mellonox are the only manufactuors of infiniband ASICS that it might not have much of a future. If it weren’t for the national labs, infiniband would have died years ago. Reference your top 50 largest machines, 50% are in the US and of those 60% are federally funded. Without that gravy train, there would be no infiniband. Watch to the coming budget to dictate their survival. It is an uphill battle to convince me that legislation doesn’t shift economies. Hold the glasses down, the tablecloth is about to be pulled.

    I agree that infiniband has a very low latency and outstanding performance but it is nowhere near the sub-100nano that the BlueGene machines can pull off. If scientific computing is the job, then use a proprietary interconnect that out performs infiniband and is custom built for the program’s needs. There is no need to keep this “middle” open technology alive other than for pet projects. Any cost/benefit review proves the TCO model.

    What we have here are computer people who know little of business and business people who know little of computing. The truth will not be found in either camp. The only wisdom I can offer is, “We’ll see” as we move down the roadmap. Is it the general consensus that in the economy will continue to develop separate modes of communication? Do we not see the righting on the wall? If the path of transports was to be developed in parallel, then we would still have wide spread ATM, x.25, Banyon, Novel, etc… Ethernet has converged everything from power, Audio/video, Telephone, and now Fibre Channel. Infiniband is in the sights and it will do to it what it did to ATM. Throw enough cash at something and you will find a solution. The tone from OFED in Sonoma to the HPC Forum is, “Get ready” Channel I/O is going to be a layer on Ethernet. It will all be converged. Slim L2 RDMA functionality will be a vLAN configuration with TRILL enabled. 20x electrical lanes running 10GigBaud is already pushing 200Gb/sec to a CFP. The CMOS wafer is almost at 28nm. Size is not a factor, heat is not a factor, fiber is now cheaper than copper with attached optics, and now we are seeing well under 5W per port on 10Gig. It is moving too quickly for Infiniband and their captive IBTA to stay ahead.

    If you read this far and are not furious, then you have already come to the last stage of truth: wide spread acceptance. This business goal and technology direction should be self-evident to you. If it is not, then you have been cooped up in your camp too long and are starting to get weird. Step outside and look around.

    Comments welcome-

  4. EE&J – whew! We appreciate our readers who comment, and this thread is definitely getting more than it’s share of good ones. Thanks for taking the time – I’ll bet you get a response to that.

  5. It’s fun to read articles from the past…so now 3 years later…where are we?