Interview: Why Software Defined Infrastructure Makes Sense for HPC

In a software-defined datacenter, application workloads such as big data, analytics, simulation and design are serviced automatically by the most appropriate resource, whether running locally or in the cloud. To learn more, we caught up with Jay Muelhoefer, Director of Software-Defined Infrastructure at the company.

insideHPC: IBM made some announcements last week about software-defined storage, and I was hoping that you and I could talk about this in a broader sense, about how this might affect HPC.

Jay Muelhoefer, IBM

Jay Muelhoefer, IBM

Jay Muelhoefer: Rich, I’d be happy to. And I probably just want to also provide some context. I came to IBM via the acquisition of Platform Computing. And so what I hope to cover today is really what’s happened to platform computing through that acquisition. There’s also been other IBM assets around HPC, namely GPFS. What’s been the evolution of those items as well and how they really come together under this concept of software-defined infrastructure, and how we’re now taking these capabilities and expanding them into other initiatives that have sort of bled into the HPC space. And, really what IBM is providing is a much broader vision and picture of how HPC can evolve and really embrace a lot of these new technologies and concepts that are coming into our space.

insideHPC: That sounds terrific. I’ve brought your slides up here Jay, why don’t we start with that, it’s Software-Defined Infrastructure from IBM.

Jay Muelhoefer: I’m just going to jump into the second slide to provide a little bit of a context here. Every slide deck needs to have that marketing slide, starting off on this concept of just innovation and design and really why are companies thinking about changing what they’re doing today. It really is that companies almost need to have an innovation first agenda. It comes back to, again, what are the products that we’re delivering? What are the services we’re delivering? And why do we kind of exist to begin with? It really is because if you’re company or you’re an organization, a government research, education, you are trying to do some sort of design or analytics, better, faster, cheaper. How can you improve the fidelity of what you’re producing, start to embrace this massive explosion of data that’s happening inside of our environment as well as this need to get your end now assist, your engineering, out to market to your end consumers, who your customers are – faster? We’re also under a lot of cost pressures. How can we do this in today’s world where budgets are flat and or declining? People are being asked to do more. How can you administer even larger infrastructures? And really it requires essentially a new approach to the problem.

That’s going into slide three, taking it down a level and what do we see as kind of these key trends? This is kind of a forces diagram. At the top there’s just this unquenching demand for additional compute. There’s all sorts of new types of applications that are coming into the marketplace, there’s more end users that want to tap into a high performance type computing infrastructure. I put an example on here of Hadoop. It is a new type of application that’s bled in to this space. I’ll go into more detail in even more types of applications, but again, everybody wants to tap into these type of resources. What’s really interesting about it is that more and more applications are leveraging a cluster, a scale-out type of infrastructure. People used to say clusters and grids – that sounds a little bit old. It’s been around for awhile, but it’s really in vogue now and you’re starting to see people use it more broadly.

On the supply side at the bottom, people are saying, “Okay, how do I provide the resources whether to compute data or networking resources? How can I take advantage of the latest technologies around GPUs? How can I take advantage of the clouds?” Here in example I put in SoftLayer, which is a company that’s a cloud infrastructure service provider that IBM acquired less than two years ago, and has really built up many more data centers. They just announced four more, it’s really a global network that people can tap into to support these types of applications.

On the left side, efficiency. People are trying to get more out of what they already own. Net software defined storage, as we made some big announcements, I’ll go into that in more detail. And on the right, business agility, so changing business requirements. How do I respond more quickly to my end users? And people are using things like OpenStack. Just last year at Supercomputing we had more and more people talking about OpenStack infrastructures, how do you develop your own internal private cloud and to very rapidly deploy resources to those end users in a self service manner? Then be able to do things like charge back and everything else you need to be able to manage that overall environment.

So, moving on to slide four. Obviously this impacts everybody in the IT side. There’s more projects, you have increased volume of workload, competing priorities, more demands on the infrastructure. How are you adding more servers, more heterogeneous type environments? Really addressing some of the newer technologies and then just less time. How do you do more, more, more with fewer resources? That’s really where on slide five, we introduce the concept of software-defined infrastructure.

A common question I get is, “I’ve heard about software-defined data centers, software-defined environments. What does IBM mean by software-defined infrastructure and why did we adapt that term?” It’s really because software-defined data centers, thinking about one data center. We really think about multiples of them, they can be located around the world, they can be heterogeneous. How do we build out that broader infrastructure to manage all those different types of resources? Fundamentally it’s all about getting people out of cluster sprawl. Taking these disparate infrastructure silos and bringing them together, consolidating them virtually, if necessary – sometimes physically – into an elastic shared resource pool that anybody can tap into available either on premise or on the cloud. We have a lot of people that are asking us about the cloud. How do they safely and securely embrace it?

Ultimately you can see kind of two big value props that people are looking for. One is that they want to accelerate the time to results for their high performance type applications. In the HPC space, simulation, design and research, it can be analytics, it can be big data – and really drive that throughput. We did a study with the Platform Computing LSF solution, where we did a benchmark and showed that we had a better throughput then a portfolio of other providers by up to 150 X, which is pretty dramatic. So, if people haven’t checked out that benchmark you can do that.

On the right side, dramatically reduce IT costs by really increasing that utilization. Get people out of silos, thinking about your department or maybe one or two areas, how can you bring this together more broadly and drive up to a 4 X type utilization increase? And fundamentally it all comes down to being able to pool resources, aggregate them into one shared global compute and data. And that’s an absolute critical point, is that a lot of our clients are just thinking about compute, they want to do their data aware scheduling. They really want to manage this in a much more dynamic manner, connect those resources together so that you can actually – with the different groups and applications – to plug those in at the top, and then as well as having the workload engines that maximize performance at scale, and finally to optimize that. So when you do the initial placement of a workload onto a resource, the type of resources change over time. How do you dynamically move the job to optimize the overall resources and throughput of the overall system? Absolutely critical.

That brings me onto slide six. What I’m going to do is I’m going to talk a little bit coming in from the top on the application down, and then the next couple slides build up from the infrastructure up.

So, what we see is that there’s a lot of different groupings of applications. On the left here you see the high performance analytics which is very low latency, parallel, sub millisecond type response, things like fraud, risk analytics. In the middle there moving to the right, we see people that are interested in Hadoop, big data applications. IBM has its big portfolio, Cloudera. High performance computing, which we’re all very familiar with. There’s hundreds of logos that can be put there. On the right we’re seeing also an interest in what we’re calling long running services, these application frameworks. All these different types of groups of applications are typically built in the silos and they don’t need to be. What you want to be able to do is share the resources across them. If you look from left to right it’s actually very low latency applications to very long running on the far right. But, we’ve found that a type of workload engine that you need from each of these different classes of applications is different. We have ones that are optimized in each one.

A lot of people are probably familiar with Platform LSF, but you have a super low latency application, like in financial services type things or other kind of near real time analytics, we have Platform Symphony.

We have a map-reduced solution built specifically for big data in Hadoop, or we have the Platform Application Service Controller for these long running applications like Spark, MongoDB and Cassandra. What’s really important is that we see more people have more of these different type applications, that they want to plug into that common resource management layer. You may start with one and expand out over time. It’s just such a powerful concept and it’s very different from only having one workload engine optimized to one type of workload that you’re trying to repurpose. These are really built, purpose built, for those specific applications.

If we move on to slide seven what you’ll see here – is that the software defined infrastructure, now I’ve built down the next couple layers. I’m going to go and talk about this in the bottom up and come back up to that resource management layer – is that we see people have this very heterogeneous world of resources. Flash, TapeDisk, they have different compute power, X86, Linux on Z. People are also adopting new technologies like Docker or heterogeneous virtualization all the way to cloud type resources like SoftLayer. And people want to tap into all these resources as if they were just virtually one. And that’s really what the IBM software, I find, infrastructure capability does, leveraging a lot of the Platform Computing technologies. So, you can see in the infrastructure management layer moving up, the bare middle provisioning, and the virtual machine provisioning. We’ve also built in a very deep integration with OpenStack. We’ve actually replaced the scheduling engine inside of it with platform technologies and also can build those clusters out with the Platform Cluster Manager solution.

On top of that, we have now launched IBM Spectrum Storage, which is IBM’s new solution for the complete software defined storage space, and I’m going to go a little bit more detail on that in the next slide. And on top of that, you have the resource management. Now, what you see here, is that is a complete stack from the applications to the infrastructure, a full software defined infrastructure middleware layer that can manage and be that software switch to connect the applications to the right resources dynamically, leveraging all the great technology that we’ve seen historically in Platform Computing in the HPC space, but now broadened out to a much larger class of applications. And we even see on the top right there, there’s other applications that are not more of these high performance type applications that can also plug in and leverage Spectrum Storage. There are block type applications, or traditional applications that may also want to leverage OpenStack as well as SoftLayer. So it’s really important that you can combine both your HPC type applications, your traditional applications, as well as some of these new type applications that could be mobile or social type applications that still need all the great technology capabilities of things like Spectrum Storage as well as Platform Cluster Manager, but might not be leveraging the full resource management layer. But we’ll see over time that we will expand the number of workload engines for other types of application classes where it makes sense. So it’s really important for people when they invest in this, they see that they’re investing in something that’s extensible and will continue to grow over time.

Now moving on to slide eight, I want to go into a little bit more detail about the new Spectrum Storage announcement that happened recently. We announced over a billion dollar investment over the next five years in software-defined storage. What IBM is doing is, it is unleashing all the software intelligence in its storage solutions and making them available as software, either as pure software, as a cloud based service, or in many cases as part of an appliance. And what we have here is a complete portfolio. You can see that we’ve grouped them into three big areas. One around agility, control, and efficiency. It’s really important to connect these back to the value props depending upon what your objectives are. Again, you can start with any one of these and expand over time, or use combinations of them where it makes sense. So, for an example in the case of agility, one of the big announcements we made was Spectrum Accelerate. Now that historically has been based on a technology that was in the XID solution, but now we’ve made available as software, is also available as a service on SoftLayer. And what Spectrum Accelerate does is it allows you to create your own virtual private cloud, in minutes versus weeks for your traditional block-type applications.

Now, also in the slide here you see under the agility, under the elasticity is Spectrum Scale. Now, Spectrum Scale is based upon the technology that was in GPFS and then went for a short period of time into something code named Elastic Storage. And we did that because we knew that we were going to be renaming the rest of the portfolio, but what it signals is that we’re making a massive investment in that technology to grow it and to expand the number of use cases, and then make it a lot more useable out of the gate, as well as making its capabilities available on things like soft SoftLayer. So, Spectrum Scale really is that file and object solution. It scales to theoretically a yottabyte of data. It really supports all these big data type applications as well as your cloud applications, as well as the historical HPC arena type of requirements for a file solution.

Now in a control area it’s all about, how do you manage all your data? You want to be able to manage it whether it’s located on premise, in the cloud, anywhere around the globe. And so it essentially unifies that and simplifies the management of all your storage and your data, as well as adding things like governance, inspect and protect.

And then finally on the right, efficiency for your traditional type applications, you want to be able to virtualize that block storage. We have Spectrum Virtualize that does that across a heterogeneous portfolio of storage as well as manage the placement of that into the lowest cost arena with Spectrum Archive. So, what’s really important is that it brings together the capability to manage all your data and put the data in the right place to maximize that trade-off between performance – maybe putting it on flash – or cost efficiency if you want to move it out to tape, or actually into long-term cloud storage to really get the most cost inefficiencies out of their solution. But what’s really important about this is that its proven technology, it’s based upon technologies that have been bullet proof proven at companies, open standards, integrates with things like OpenStack and Hadoop and Mongo adoption, so you can start with any of these and grow it over time.

So, moving on to slide nine. You know, the benefits here I think are really important, is that the benefits we see at all levels of the organization. A lot of time we’re talking to the IT Managers, who are really trying to reduce that cost while maintaining the service levels. Being responsive to the end users, how do they bring together all these silos of computing data and really adopt new technologies for end users. They really don’t care where the application is running. They just want to get their results back faster, they want to be able to use more data and they want it to be easy to use, they want it to be built into their existing solutions so that they don’t have to learn something new. And, for the CIO or the CXO type level, they want to be able to essentially lower their risk, become more agile as an overall organization and constrain the growth of costs over time. How can they do more strategic investments and get more agile in what they’re doing?

Finally, let me talk about one of our historical customers that really is seeing a lot of their use cases expand in many of these different directions that I talked about today. That is the Sanger Institute and what they’re doing around the human genome project. They’re really focused on understanding the role of genetics and health and disease and really understand how that can be translated into diagnostics, treatments and therapies that reduce global health burdens. So a fantastic mission, and they have been a long-term user of the Platform LSF solutions – well, Platform Analytics – to really optimize the entire policy-based workload schedule environment. But they’re also expanding into other areas. They’re trying to make better leverage of the data, so looking at how do they do that with things like software-defined storage as well as other technologies. So, looking at cloud-type technology to get more agility in their organization as well as things that are leveraging new technology like Docker. And based upon their use of LSF, things that we have announced with the built-in Hadoop connector with the ability to support Docker environments, to be able to send workloads to the SoftLayer cloud. It really gives them the flexibility and agility that they’re going to need, to not only to solve today’s problems, but the problems of the future as well.

Hopefully that gave everyone a quick introduction to the concept of software-defined infrastructure. How it’s expanding, but it’s really built upon the foundation of technologies that we have been leveraging in the HPC space and shows how HPC has really broken out into a lot of different areas of organizations and commercial enterprises today.

insideHPC: Thanks for that Jay. It was interesting watching this talk evolve. There was a lot of the DNA from Platform Computing in the first couple of slides and then you got to the software defined part, but what do you think is causing all the confusion out there about software defined? Where does that come from?

Jay Muelhoefer: That’s a great question. So I think there’s a lot of confusion because there’s so much interest. There’s so much potential in the benefits that one can get out of moving towards a software-defined environment by unlocking the innovation and agility. Defining things in software you can actually have faster releases as well as being able to connect to these heterogeneous type of environments. Really what that gives people is the ability to automate their environment and adopt those latest technologies in a much more agile and fluid way. But the confusion stems from the fact that there are other vendors out there that don’t truly have software-defined offerings that are trying to wrap that moniker around any old solution and claim it to be software-defined. You know, its interesting. IBM was just ranked by IDC as the number one provider of software-defined storage platforms and really, they took a look at what we offer and the flexibility and really the ability to deliver it in a software only form factor so that people can deploy in many different places, both on premise and the cloud. And so again, you have these other vendors, they only have one option and so they try to apply it to everything.

insideHPC: Yeah, I mean everything has software in it, right? Storage devices, you name it. You need drivers and things like that and that’s part of what makes it work. But Jay, I mean when I was listening to your talk, I remember the talks from Platform and how you guys were rolling out your vision. I wanted to ask you about now that your part of IBM, has that really, that broader range of resources, has that enabled the vision from Platform to really come together as this software-defined infrastructure that we heard about today?

Jay Muelhoefer: It really has. Platform obviously has always had great technology and really the ability to handle a lot of complexity that you see in some of the largest environments out there when you go into massive EDA-type of environments, but what we’ve seen is that a lot of the enterprises have the similar challenges and being able to take that technology capability and apply it to many other workloads. Now, obviously at IBM it gives us a great access to distribution, the ability to work with the sales client team and have that CIO type level conversation as well as really understand the problems across a lot of different areas. Now, you can already see just in this presentation you know, talking about connecting the Platform Computing technology to SoftLayer very deeply and natively. It’s something it could not have done by its own, and it’s really nothing that other HPC vendors can offer, both that deep on premise as well as that Cloud capability. Also the things around software-defined storage. Being a provider of that file and object data management solution as well as on the compute side, so we can do things that really optimize across those boundaries.

Now there’s other players out there that might just be in the workload management job scheduling area, they don’t have a file management solution like that that they can do that kind of R & D type level integration as well as build those road maps. So these things I mentioned data where scheduling is simple example of one of those but also doing that across a very distributed on premise as well as cloud type of environment. Again, phenomenal resources that IBM can bring and invest in to really expand the vision that was with platform computing into the larger concept of software defined infrastructure.

insideHPC: Jay, I wanted to clarify this idea about heterogeneous data centers. There’s no such place where everything is from IBM. Does this vision you’ve described work for all kinds of different type of vendor solutions as in terms of hardware?

Jay Muelhoefer: It definitely does. That’s one thing that’s really important obviously we have a long history of working with multiple x86 providers, there is nothing changed in what we’re doing there. IBM does have it’s power architecture in OpenPOWER there’s a lot of great use cases that support that, especially in the big data arena. We’ll continue to build out those integrations there. But that’s where you’re going to get platform computing, Spectrum Scale which is based upon the technology that was in GPFS. Again, these are great technologies that maintain their openness and the ability to work with open standards, and that’s part and parcel and just core to what we’re doing going forward.

insideHPC: Well great, well I’m glad you brought up GPFS because I think this is the third time you’ve renamed it, right? So, you had Elastic Storage, GPFS [chuckles]. Well you know, it makes sense the way you plugged it in here. Is this the end game now? Is it about GPFS going forward, or is this much broader kind of stroke?

Jay Muelhoefer: Correct, yes. We will not be renaming it [chuckles]. So, we have the Spectrum Storage Family of Solutions, which obviously it shows a breadth that we can provide there. It’s based upon proven technologies, and I think that’s one of the other great things about it. Obviously, GPFS is a very powerful solution but really investing in it and while it’s been powerful, it also had some limitations of almost being too powerful. But now with Spectrum Storage and that becoming Spectrum Scale it really is this concept built on proven customer deployment, this flexibility and being part of something that IBM is making a very large investment in. A billion dollar investment in the Spectrum Storage Family is fairly significant and Spectrum Scale is definitely going to be benefiting from that. The answer to your question is no, we will not be renaming it again.

insideHPC: All right. Well I guess that comes to the final question Jay. Can customers get this stuff today?

Jay Muelhoefer: They definitely can. Everything that I’ve spoken about today is available. What I recommend is that you work with IBM, work with one of the IBM partners and/or you can reach out to us on IBM.com/platformcomputing. It’s a great way to get started. I know a lot of people who’ve probably used that as their starting point, or you can do it with Spectrum Storage, so IBM.com/spectrumstorage. Those are a great starting points to get into the whole software-defined infrastructure conversation.

Watch the SlidecastView the Slides Sign up for our insideHPC Newsletter