The HPC community is anticipating an announcement from Intel, possibly timed with the ISC14 conference, to unveil more of its fabric integration strategy and branded products. There is growing evidence, in terms of new accounts lining up to use the Intel True Scale fabric product, to indicate Intel’s strategy is well thought out – going back to the acquisition of the QLogic assets. At various conferences and user groups, we’ve been running into a growing number of Intel’s customers and partners who have been briefed under NDA from Intel, and while they are all strict about honoring their agreements of confidentiality, I’m frequently hearing, “Intel really seems to have figured out this fabric integration.”
Intel’s True Scale Fabric is part of the Intel’s Technical Computing Group (Raj Hazra, Charlie Wuischpard) and is targeted squarely at the high performance computing market.
We contacted Joe Yaworski, Director of Marketing for Intel’s High Performance Fabric Operations to discuss the success of Intel’s True Scale Fabric and to break it down to better understand this growing acceptance.
This article is based on that interview and is also available as an audio podcast. Download the MP3.
insideHPC: When we talk about Intel’s True Scale fabric, what are we talking about? Switches and channel adapters, or is it broader than that?
Yaworski: Intel True Scale Fabric consists of a full line of adapters for servers in different form factors, as well as a line of fabric switches, consisting of ‘edge’ or ‘top of rack’ switches that start at as little as 18 ports. There is also a line of ‘director class’ switches that go as high as 864 ports. And then, of course, all of the management software.
insideHPC: We know True Scale is growing in popularity based on installations, but what can you tell us about the growth – and what’s happened since you guys became part of Intel? (Intel Acquires the Infiniband business of QLogic).
Yaworski: Our revenue year on year has grown about 40%. We came into Intel about mid-2012, and there was the normal integration process that just takes time to work through. It took a while for us to get incorporated, as well as fully up to speed in Intel, while synching up all of the Intel individuals involved in helping to evangelize high performance computing. So when you look at it, we completed that process between second and third quarter of last year. That’s the full integration including the training of individuals that frankly were used to selling basically CPUs and servers into the high performance computing space. So now, the team has been brought up to speed on how to sell fabrics, in particular, True Scale. So we’re now starting to reap the benefits of all that ground work. We’ve had an excellent start to this year that’s continuing on into the second quarter. Kind of the same type of momentum – thirty to forty plus percent increase, year on year over what we did in revenue.
insideHPC: It seems like I’ve been hearing a lot of buzz, especially in the past quarter, Joe. Announcements from Intel – a lot of focus around HPC – and of course True Scale wins. What do you think is driving this surge? Is it just a happenstance that there’s a number of announcements this quarter or is this a sign of a bigger shift coming?
Yaworski: I think it’s a sign of a bigger shift coming. As I said, now, the broader ecosystem of resources at Intel have a better understanding of fabrics, and the value of fabrics, specifically, the value of True Scale and what makes it different. So that’s been driving a lot, plus we’ve had a lot of customers that have basically added on to existing clusters or added on to contracts they had in the past with True Scale. One of those that you might be familiar with is the Department of Energy – the TLCC2 contract that was led a couple of years ago. So as a frame contract, where three major labs in the Department of Energy could order servers based on Intel processors and the True Scale fabric. Those installations have gone quite well, to the point where they have continued to order, and order some fairly substantial clusters.
Then in addition to that, the performance of these clusters has gone so well that the labs have actually written some very interesting white papers that go through exploring why they’re getting such good performance out of the clusters that are based on True Scale. So not only have they ordered a substantial amount of equipment, their experience with it has been very good, to the point where they’re still placing orders for True Scale.
insideHPC: Yes, I remember when we last talked, the Tri-labs was still news, (Livermore, Sandia, and Los Alamos) that you had won that business, and now, you’re past the deployment stage. It sounds like things are going very well. What do you attribute that to? Is that the proof point that the True Scale way of approaching InfiniBand is effective?
Yaworski: If you look at when True Scale was designed and originally architected, it was architected after the marketplace had already made a decision that InfiniBand was going to be an interconnect for high performance computing. That was in about the 2003 time frame. To go back before that, InfiniBand was really being designed and architected basically for the data center as a channel interconnect for the data center to replace PCI. Its whole architecture was designed for that.
Now, it did bring some advantages into the mix that were relevant to HPC, so it started to get some early use in HPC. When it didn’t catch on in the data center marketplace in that early 2000-2002 time frame, it found a very comfortable home in high performance computing. Because if you looked at it, it was a standards-based interconnect that offered relatively low latency at about 15 microseconds, versus Ethernet, at that point in time, at about 150 microseconds. And it offered anywhere from ten to a hundred times the bandwidth, when compared to a hundred megabit or gigabit Ethernet. So it had the basic fundamental architecture for high-performance computing, but the InfiniBand architecture need a big retro-fit for HPC. The areas where it was retro-fitted is in a library called Verbs. Verbs is the interface between all of the protocols and the adapters’ device driver. Making a long story short, InfiniBand was not originally well suited for the primary protocol for HPC, that being MPI, or Message Passing Interface.
So at the time, there was a company out there called PathScale. PathScale got acquired by QLogic, and in tern QLogic’s InfiniBand group got acquired by Intel; that’s how you can map its lineage into Intel. At that time, Pathscale looked at IB and said, “InfiniBand is a great interconnect, in terms of its latency and its bandwidth, however, Verbs is the wrong architecture for MPI.” So they took a page out of two of the proprietary interconnect groups that were selling into HPC, that being Myricom and Quadrics. Essentially, they redesigned an interface for MPI that was specifically matched to the semantics of MPI. Instead of using InfiniBand’s connection-based design, they developed a connection-less design. So the benefit of that is number one, we get very high MPI message rate throughput, which is key to supporting applications at scale. Second is it is a connection-less architecture, we maintain a very low end-to-end latency at scale. So the combination of being able to handle very high message rate, especially with short messages for MPI, and maintaining the low end-to-end latency, has proved a very effective architecture for performance that scales.
That’s essentially what the Department of Energy, the Tri-labs found, when they installed the True Scale fabric. They got not only very good performance, but they got their applications to scale even further than they had expected. So as a result of that, they wrote a couple of different white papers that basically get into the details of that performance.
insideHPC: So now, they’re going to move forward, and hopefully, replicate that scaling, based on—well, because True Scale was optimized for HPC from the beginning?
Yaworski: Correct, and so, it varies from the standard InfiniBand architecture in that library called PSM. Instead of being connection-based, which means that on the adapter, it has to maintain all the state information. So as you begin to scale, there’s a chance – actually an increasing chance – that you’ll have a cache miss. And once you have a cache miss, then you have to go back to main memory, pull the information out, refresh the cache. Whereas, on True Scale, PSM layer being a very high message rate, and this connection-less design, we scale and maintain low latency, even at scale. When you look at the size of the clusters that the Tri-labs have deployed under this contract, which are in the thousands of nodes, then you look at the performance of their applications and the performance of True Scale with them, they’ve had a very good experience with it.
insideHPC: Joe, I wanted to jump around here a little bit, because I’ve been reading about something called “QDR-80.” How is that related to how True Scale architecture works?
Yaworski: QDR-80 is basically doubling the band-width going to each node. But maybe more importantly, it’s potentially doubling the message rate out of each node. And here’s the reason why.
Starting with the Sandy Bridge-based Intel processors, the PCI bus is now part of the processor itself, so it’s embedded in each processor. In a two-node, or a DP or dual process, or a dual socket system, you basically have two processors, each with their own PCIe bus. So we get very good performance when we put in a single card, but for that remote socket, it has to communicate across the interface called QPI to the first socket with the adapter sitting on it. So there’s some increased latency, as well as other performance considerations when doing that. So certain customers, those who require additional bandwidth, but more importantly, customers that require a higher message rate, will end up putting an adaptor on each of the sockets connected to the PCI bus of each of the sockets. Then, PSM has some code in it that recognizes this configuration, and does what’s called core affinity. A quick definition of core affinity, is it sends the messages from the core on a particular socket to the adapter that’s connected to it, automatically. So it is transparent to MPI or the applications. And what that means is that for inter-node communications, we eliminate that traffic over the QPI interface between the sockets. We reduce latency, we increase the message rate, almost doubling the message rate. As you scale an application, the message rates go up and up and up as well as the size drops; you now have a direct path from each of the sockets of the DP system out to the network to communicate with all the other nodes. We have gotten very, very good performance out of that. In addition, the side benefit of it is you’ve doubled the bandwidth. For storage, traffic, and so forth, you pick up an incremental level of benefit to that.
It’s proven to be a very good architecture for certain types of applications, and certain types of customers, especially those at the very high end. That 10-15% of the market place. Otherwise, our QDR-40 – single adaptor – has been very effective for the majority of the marketplace.
insideHPC: OK, so that’s the QDR-80 – the latest thing then. But I want to ask about some terminology. What do you guys mean when you guys talk about CPU fabric integration?
Yaworski: Ah, that gets into our next generation.
insideHPC: Oh, the good stuff.
Yaworski: No, I would term it as the better stuff. We have really good stuff now, and I would term the next gen as the better stuff. We’re not prepared to go into a lot of details about our next generation fabric. But clearly, one of the things that we will be bringing to marketplace with it, is CPU fabric integration. What that means, is today, when you look at all the high performance inter-connects; they’re are discrete to the CPU. So you plug in a card that uses the PCIe bus to get access to the processor. So, what we mean by CPU fabric integration, is over time, we will drive it closer and closer to the processor, and eventually, it will sit inside the CPU dye.
You pick up five value vectors at that point in time. One is an in increase in performance; so the closer you can drive the fabric to the CPU, the more things you can do to increase the overall performance of both the CPU, and the fabric together. Number two, you pick up density. Because now you’re not taking up any board space or PCIe slots and things like that. Number three, you pick up also the options for improved value, in terms of price per performance. Number four, you reduce power. And number five, by getting rid of things like the PCIe bus, you reduce componentry – which again reduces power – as well as improves reliability. So, performance, density, price-per performance, reliability, and power, are all improved by this integration.
insideHPC: So good stuff coming down the pike here with CPU fabric integration. Okay, that’s tomorrow; I know that’s a ways off, and I don’t want to ask you questions you can’t answer. Let’s talk about today some more Joe; what about the types of verticals that you’re seeing out there? Certainly the labs– with Tri-lab’s as a customer, they’ve seen some benefits from what you guys are doing, but what kind of verticals are out there using True Scale?
Yaworski: Pretty much all of the major verticals that use high performance computing. And the big ones are obviously the government labs. We have a very good presence in the university space where they’re doing research – scientific, life sciences, as an instructional tool as well as a research tool. Very good pickup there with True Scale. Also the manufacturing space. Another good example is the automotive space. Many of the European designers and manufacturers of automobiles utilize True Scale, and utilize it today. So all of those very interesting and really cool cars coming out of Europe, basically, were designed on a True Scale fabric.
Going beyond that, the life sciences areas are very strong for us. When you take that into account, all of the of the major verticals that are using HPC today, utilize True Scale, but especially the government, education, life sciences, manufacturing and energy. We have significant installations throughout many of the energy producers, such as in the Middle East. Organizations like Saudi Aramco and so forth, have very, very large farms of clusters all based on True Scale. It is for these reasons that the revenue has and the base has been growing quite extensively.
insideHPC: Well, great. I know you mentioned Saudi Aramco; are there any other name brand organizations that are on the True Scale bandwagon you can tell us about?
Yaworski: Audi would be a good one. In fact, they demonstrated at last year’s International Supercomputing Conference (ISC), and this is kind of one of those neat demonstrations. They had an Audi R-5 car at the Intel booth. And as part of the demonstration, they had a high definition TV, and when you looked at the picture of the car on it, you would think, “Oh it’s just sitting in a showroom somewhere.” Well as part of the demonstration, they visually explode the car into all of its parts, and then bring it back together again through the use of a combination of high performance computing and visualization. That really got some attention as to what they were demonstrating there, and how they use high performance computing, and how they use True Scale to tie it all together. Essentially, that gives the Audi executives the ability to take a look at a car design as if it was a real picture of the car itself, and then make adjustments to it. Change, for instance, the shape of the lights, and see how that looks. Or change a door panel, and see how it looks. But also get the rendering of the reflections and so forth of it. They can clearly improve their designs overall. It was a really great demonstration and a great application for True Scale.
insideHPC: Yeah, showing all that low-latency enables a business to go forward with innovative design. So Joe, to wrap up here I wanted to ask you what’s next; you talked about CPU fabric integration, but of course InfiniBand goes in generations, right? We’re in the FDR universe now, the next one I believe is EDR; can we expect that kind of thing to be coming from Intel in the future?
Yaworski: The answer is, we will have a next generation offering. But this offering is really going to eventually set the foundation for Intel’s stated direction, which is to support Exascale. When you look at the requirements for Exascale, and what needs to be done and so forth, you can see that InfiniBand has been an excellent technology, but there are some things that need to change and fundamentally change, to support something that will eventually handle Exascale. To put it into perspective, depending on the performance of the processor, the size of the cluster that will be needed will be somewhere between 150,000 to 200,000 servers with tens of millions of cores. And when you look at today’s high performance fabric technology, as I said, it’s great technology, and it’s brought us to this point, but to take us to the next level, we really need to re-look at what the requirements are that will lead us all the way up to being able to support Exascale deployments. One of these absolute requirements is CPU fabric integration, because the performance that’s needed, the density, the power, are all areas that have to be vastly improved to support deployments of exascale. And then of course reliability, because when you’re looking at deploying 150,000 to 200,000 servers, reliability becomes a key metric that has to be significantly improved. And then finally, the overall price performance needs to significantly change. When you look at it, CPU fabric integration will be key, but then there’s a lot of other technologies that need to go along to really make that work. So InfiniBand is great technology today, but as we look to go to Exascale, we’re going to need a next generation fabric. That’s essentially what Intel is looking to do.