In this slidecast, Fritz Ferstl from Univa presents: Rev Up Your HPC Engine. The presentation explores the challenges for Workload Management systems in today’s datacenters with ever-increasing core counts.
Rich Brueckner: Welcome to the Rich Report, a podcast with news and information on a high performance computing. Today my guest is from Univa, we have the CTO of the company, Fritz Ferstl. Fritz, how are you doing today?
Fritz Ferstl: I am great, Rich. How are you?
Rich Brueckner: Oh, I’m great, I’m getting over a little cold here but I’m going to be ready for ISC when it comes. Anyway Fritz, I thought we should catch up. There was some news yesterday about Univa working with a formula one team in India called Sahara Force India. So I guess congratulations are in order.
Fritz Ferstl: Thank you.
Rich Brueckner: Yeah, for these teams– HPC’s become a competitive weapon in formula one. So this makes perfect sense to me.
Fritz Ferstl: I am an avid formula one fan. I know everything about that. Specifically also the importance of HPC. These guys can choose between doing HPC experiments or wind tunnel stuff. And sometimes, HPC experiments is the better investment. And that stuff is run on Grid Engine.
Rich Brueckner: Well, yeah. And it makes sense. If you’re going to be productive, you need a good system management software package. So, we’ll have to wait for more details on that but in the meantime we have a slide deck here and I thought, why don’t we go through that and we’ll follow it with a Q and A.
Fritz Ferstl: Yup, okay. The title of the talk is Rev up Your HPC Engine. That actually was the title of the talk even before the news of yesterday. Now, as for slide number two, this is my only slide where I will be talking about Grid Engine or Univa in this particular talk. I’m going to talk more about challenges in HPC systems and how they project to workload management in general and what type of solutions you can apply and stuff like that. But my little boilerplate is about Univa, so we call ourselves data center automation experts. We try to allow our customers do more with less when it comes to big computers or big data. We sometimes say we help those organizations play a better game of Tetris, putting workloads in the best fit.
We are based in Chicago and we have lots of offices around the globe. We are now in the business we’re currently in for about three years, and have acquired more than five hundred customers throughout those years, and most of them are actually Fortune 500 customers, you see a set of logos on the right-hand side, and they really go across all verticals, whether that’s manufacturing or oil and gas, or chip design, or biotech life sciences, or also transportation business for instance. We have a number of customers in that market as well.
In terms of products and technologies, we have four pillar products. One, the primary product certainly is Univa Grid Engine then UniSight, which is accounting, reporting, and analytics package which sits on top of Univa Grid Engine. Then there is our License Orchestrator product which helps share licenses across clusters, for instance, or across projects and similar things. And then UniCloud, which allows you to deploy a cluster, with everything configured readily, like with Grid Engine, for instance. Deploy a cluster on a set of bare metal machines, on virtual machines, or on cloud nodes, or on mixes if you want to run a hybrid model. And then you can flex, for instance, the size of the cluster if that’s something that you need to do. So that’s the overview on Univa.
Now, really over to our talk on slide number three. What type of challenges for workload and resource management systems do I see in in HPC these days? So, on slide number four, first of all, and foremost, certainly scalability is always an issue and it has been for the 20 years I am in this business. What you see currently if you go into a state of the art HPC data center, and you ask about growth numbers and stuff like this, then you will notice that the note counts often stay flat or sometimes they even go down. The number of sockets pretty much stays flat usually. But the core counts explode. So, what comes along with it is, in particular, that also the numbers of jobs that you can actually run in such a system explodes. And that also comes down to shorter run-times for instance, for those jobs, so there’s just more traffic on such a system, and that really puts a load and a strain on the workload management system.
In large commercial sites they currently approach or go beyond the 100,000 cores. So that’s where our biggest customers basically are. And then when you look at throughput clusters, they process more than 150 million jobs a month. That’s really a staggering amount and the workload management system has to keep up with that.
Another dimension is heterogeneity. First of all, you have a lot of heterogeneity in the hardware. You have all those multi-socket systems and each of those sockets has multiple cores. And then you do partial cluster upgrades. So it’s not uncommon that in a single cluster you have several generations of chip sets and server architectures. And you have to manage all that. You have to make sure that jobs run on all of them or that the right jobs run on where they are designated to run. With those different server architectures, of course you have also evolution in the memory architecture, in the networking, in the storage systems and you have to account to that as well. And then finally nowadays you also add accelerators into it, like Nvidia, GPUs or files from Intel. And then of course you want to run something on that hardware, and what you want to run is also very diverse so you have different types of job profiles. Like for instance, throughput jobs which are usually small units and lots and lots and lots of them. Sometimes you can combine them into what we call in an array job where this is basically the same type of job crunching on a huge set of data and it get’s instantiated over and over and over again until all the data is digested. Or you have large parallel jobs, which run for longer time and consume a larger portion of the cluster. You may have interactive workloads or sessions. You know where you have multiple things which are grouped together. Then you have reservations, where you need to reserve parts of the cluster for some future project or something like that. You may have transactional workloads, like in the big data space, and then you have hybrids of all of that where you combine those different things, and you group them into dependencies and workflows and all of that type of stuff.
With another dimension is, I mean you have all these different workloads, but of course you want to orchestrate them, you want to decide how the cluster is going to be used by them. For that you have policies, and you actually have a lot of different policies. So, for most of course want to automate things so that those policies are usually automated, but immediately you get the challenge of transparency, so can I really be sure that what I have set forth in the policy is being implemented? Am I able to prove it to my users, for instance? Can I inspect it myself? These types of things. In at least– in corner cases, you want to be able to do manual overrides, so that somehow has to be woven into the picture as well. You have policies that determine automatically the preferential access like fair sharing and similar things that would allow you to define priorities that control reservations that give jobs more priority more urgency if they require very expensive resources, lets say like a super expensive license, which can sit idle, you have quotas which you want to manage. You have deadlines, you want to manage and really the trickiest part of all of that is those policies do not necessarily fall in step so they often conflict. If I have, let’s say, throughput jobs and large parallel jobs, they have very different policies and requirements and those conflicts need to be resolved somehow.
Another angle to the picture is the variety of the different use cases which are implemented. If you just look at the single jobs the whole situation is easy but if you look at what is the over arched operational model of an organization than things get a little bit more complicated. So you could have for instance a classical HPC site which largely does simulation. I would say somebody like an oil and gas company would do that or an industrial manufacturing company. So, you have large parallel jobs, or many mid-sized parallel jobs, or many mid-sized parallel jobs but not too many. In a verification, or a test scenario, like you have it in the chip design. You would have a huge throughput clusters where you run very small jobs but lots and lots of them. Sometimes as I mentioned before, you can group them up and move them into an array job if you do something like a parameter study so that makes things more efficient.
In some cases for instance in the financial services arena, we run more into after short jobs which may take much less than a few seconds. Big data and data mining has yet a different type of job profile and then there’s also a difference whether, for instance, you really want to use your nodes exclusively by so each job, really owns a node, or whether we have shared usage. Because if you have shared usage, then you face a number of problems, like one job stealing another job. The memory away, or similar things like that. And, you have to have the means to fence jobs, and manage that type of conflict scenario.
So finally, to conclude my list of challenges and that’s certainly not a final list on slide number eight, I have a rather geographical distribution and clouds that we have nowadays, specifically in larger organizations or large institutions– So if you have multiple clusters spread around the globe, then you definitely still want to be able to share at least some of those resources. That’s not only talking to servers, but also regarding licenses, for instance, or data, or other types of resources. If you share resources, then you have access latencies, obviously, so for data access that’s of course very pronounced. You have security issues which you have to handle. You have file system dependencies, so before a job starts, the data needs to be there, and you have to do things like pre- or post-staging.
In general the data locality needs to be managed so you could for instance bring the job to the data or bring the data to the job and all those kind of things play a role. Let me close with this list of challenges and I’ll segue way over to a few solutions or approaches or best practices that either we are working on as a work load management provider or which customers can consider as part their installation.
So first of all, it’s definitely yet another point in time where our type of software needs to evolve. I mean, software like grid engine or Univa grid engine in particular is over 20 years old but that doesn’t mean that this is yet still the same code that you’re using. Right? So, the product has evolved in steps overtime and I think there is another step waiting for us right now because the core count growth that we’re seeing and the growth in the job throughputs that we need to manage definitely mandates a few core enhancements in some parts of the architecture to make that feasible. The same goes for scheduling algorithms specifically, the bigger the cluster grows, the more types of job mixes you see and to handle those job mixes and the conflicts that arise from them efficiently, it’s one of the things that you need to do is really improve some of the scheduling algorithms to be able to handle that. The scheduling of ultra short jobs is a particular challenge that wasn’t really addressed for a long time but is something that needs to be addressed.
The whole aspect of monitoring, error tracking, reporting, accounting, analytics, that’s a microcosm in its own. In a big cluster like I’ve been mentioning it before, with over 100,000 cores, every point in time there is thousands and thousands of events happening, and keeping track of those, filtering out those which matter, and organizing the information such that you know what’s going on, and you know what the status of the cluster is, and whether there is an issue, and where it is, and how to resolve it. That is really an analytics challenge in its own right, and that’s something we’re working on.
Now, a few things I would like to say where customers and users can actually help themselves to be smarter about setting up the systems, and run into less issues, less problematic scenarios. The first here on slide number 11. I’ve entitled this “Street Smart.” What I mean is, don’t stand yourself in the way. Simplify wherever that’s possible. This is sometimes hard, organizationally speaking and if you are in a large organization, you have a number of projects or a number of departments which have conflicting interests. They all fight for their share of the valuable resource being the cluster, but making things simpler is sometimes on the face of it, the best solution.
If you try to accommodate everyone’s wishes then you basically end up in a giant conflict scenario. And there is really no way sometimes to resolve that cleanly. So my general advise is in such situation is really look at what are the most important goals, focus on them and do your best then, to satisfy the next level of goals, but always keep your eye on the most important goals, and if that is, for instance, to have high throughput and high utilization, then sometimes not really accommodating some of the other policies that you might want to have that one of your end users wants to have implemented is the best solution because if you look then at the data, how much throughput you got, how much utilization you got, you will see you’re much better off than trying to be you know but he’s everybody darling.
On slide number 12, I’m basically saying, ”Think different” and let me give you a few examples of what I mean. So one of our customers is running millions of jobs a day and they really were running lots of small jobs, so the consequence was they definitely had throughput performance issues. When analyzing the situation, it came down to the data persistency of our system. Obviously, a system like Univa Grid Engine needs to persist status data out of all the jobs that gets submitted, all the status changes that take place, all the cluster situation, all these things need to be persisted out and the classical ways of persisting out that data at some point you run into a bottleneck as you scale out, it’s just inevitable.
What these guys did was say, “Okay, let’s try something completely different.” So they went for SSD rates – or actually it was a Fusion-io system which they were using for persistency. A fusion-io system, you can not easily for fail-over. Unless you have two, which is expensive. So they didn’t. They simply were using backups, nightly backups or whatever it was. And you might say, What happens if your cluster folds over just before a backup? You lose a whole day. But, they were able to improve their performance by a factor of two or three, and even if they have a fail-over situation, lets say two times a year, which is much, they actually don’t have that, but even if they had that situation, they would lose two days. On 363 days of the year, they have 2, 3 times better performance so they can easily lose 2 or 3 days in compensation.
I’ve mentioned array jobs. We’re still running into customers who have ideal situations for a job and they’re still not using them and then suffer from performance problems because they have lots of jobs, so array jobs is a good solution for that. Also, if you have a chance, using more smaller jobs versus a few bigger jobs is the better choice. It’s always challenging to find room for big jobs. It’s kind of like you are trying to reserve a whole hotel in mid-vacation season for a conference – it’s impossible. So on a busy cluster it’s very hard to make room for those large jobs. Splitting things up into smaller chunks is definitely much easier to manage, if you have any chance to do that at all. And then also, usually customers shy away from preempting jobs – meaning, actually terminating them – if a higher priority workload comes in, or if for instance one of those smaller jobs really needs to be squeezed in. Often customers shy away from doing that, but on the face of it, it sometimes is the better choice. So don’t rule it out. That’s really what I’m saying here.
Another one of my pet peeves is to actually embrace and accept difference. I’ve talked about the scenario where you have, for instance, some big jobs and lots of small jobs, and those things have conflicting policies usually. And it’s sometimes very, very hard to get them aligned. Another way of handling the situation is actually embrace it, and make room in the cluster for the big jobs and the small jobs. Meaning, separate it out. You can do this simply by simply temporarily designate a part of a cluster, let’s say, for a big job, and the rest for the small jobs, well you make it a little bit more cleverly meaning automate the whole thing and that come basically down to cloud sharing. One of the things you could do is for instance break a cluster into two and let two Univa grid engine systems run on them. One is for the big jobs, one is for the smaller jobs. And then you have UniCloud which actually can manage those two clusters and can look at them and say, “Okay, the cluster with the small jobs is currently not heavily loaded but the cluster with the big jobs needs some more nodes so I moved nodes– a node over there from the small job cluster into the big job cluster.”
That’s actually a very efficient way of handling things. It has two big advantages. First of all, it simplifies the configuration in either one of those clusters. And it provides total autonomy there. So, In the big cluster, you can do whatever you want to address the needs of the big cluster and the small cluster. You can do likewise for the small jobs. What you should avoid though, is what you call meta scheduling, where you have also two clusters, but then a meta scheduler is sitting on top. Meta scheduler meaning where the jobs really go through that meta-scheduler, and the meta-scheduler then determines should it be going to the left cluster or to the right cluster. The issue with that is, first of all, it adds a level of indirection to everything – from submission, to job control, to money touring, to error tracking, to accounting, what have you. The second problem is, if a job in principle can run on cluster A and cluster B, then cluster A and cluster B basically need to be configured to allow it to run. So you really win nothing. You have no autonomy. Those clusters really need to be looking the same. So what’s the advantage? So that’s why I am really in favor of doing something like cloud sharing.
Then my final advice is to use tailored solution as much as you can. By tailoring, I mean add-ons or things like scripting, customization, job classes, wrappers, portals, all those types of things which basically add an obstruction layer to the cluster and allow end-users to go through these scripts or portals and address the cluster there, instead of going there directly. If I look at our customers who have the least problems than they usually are well equipped with portals with submission scripts, using job classes and similar things. That allows them to control really how jobs get into the system. Then they can make sure that there is nothing being submitted that goes against the policies and similar things like that. It’s much easier for the end user to use the cluster. As opposed to that if you have situations where you basically say to the end users, “Well, do whatever you like,” you’re really up for trouble. Because end users will find ways of using the cluster that you have not planned and that will cause you trouble. I mean we see that basically everyday. Using tailored solutions is definitely something that is worth considering.
So with that, I’m actually already at conclusions. So what I would like to say is that workload and resource management systems are definitely more required than they ever were. With cluster sizes like we have them today, there’s just nothing you can do without those systems and definitely when those systems stand, then there’s lots and lots of servers that produce nothing but heat, but definitely don’t produce any output. So it is very crucial to have a robust workload management system in place. Specifically in the new era of cloud and big data, there is even more of a need because for instance, big data let’s spread out the usage of clusters outside of the classical HPC into farther reaching areas like analytics and so on. If you use workload management system of the type of Univa Grid Engine and then that allows you to benefit for more than 20 years of experience in workload orchestration. And you can move beyond that as I’ve mentioned, there is a clear cut set of challenges, those are non-trivial. So if you want to address them, you have to build on the best-in-class products, architectures and also the development teams know what to do with it. And then finally being street smart about architecting the system, about configuring it, about managing it, about giving end users access to it is also an important thing how you can help yourself for a better experience. So, with that, I’m through with my presentation, and I’m sure you have something like a question.
Rich Brueckner: I certainly do Fritz, thanks for that. It strikes me that the diversity, just going back to your slide there with the types of customers that you’re working with. Companies as different as Pharma to manufacturing, right? In the types of work loads. You guys just got done with kind of road show. Maybe it’s still going, but when you bring these users together of Grid Engine, do they have stories to share that can benefit each other in a typical– even though they’re doing very different kinds of work, do you see that?
Fritz Ferstl: Yeah, definitely. First of all, our road show is more or less a permanence thing. We do these user forums and we basically go to areas where lots of our customers or prospects are sitting. And sometimes you have situations, like when you go to Houston, where lots of oil and gas customers are attending – they definitely can share. But even the different industries there are best practices that you can apply. Let’s say a life science customer running throughput is in no different situation, or financial services customer is no different situation than an EDA customer. There is certainly differences with applications and some of the workflows and stuff like that, but there definitely perspectives to share. Not a doubt.
Rich Brueckner: Yeah, yeah. And then the diversity with Univa’s Grid Engine having being around for so long, you guys support all kinds of different architectures. Everything from the x86 core to Xeon Phi or a GPU and some other types of processors as well. I mean do you have a giant lab with all this gear with hundreds of thousands of cores at your beckon?
Fritz Ferstl: That would be nice. Well, no. We have sufficient gear though. It’s actually surprising how far you can get with, for instance, using container models. You were from Sun so you know good old zones.
Linux finally has something that is comparable with LXE or Docker. And so, even let’s say, with a hundred nodes, if you use lots of containers on them, you can get to staggering amounts of simulated nodes. And that gives you a really good test bed for instance. Of course we have systems where Nvidia GPUs, and with Xeon Phis in them. We use things like Amazon of course, you know for some of the testing that we do. So, you know you have to be creative.
Rich Brueckner: Sure. And then, just from an architectural standpoint, does Grid Engine run on a single core somewhere and just has a lot of little widgets running on each device that report to it or is it more of a distributed kind of architecture?
Fritz Ferstl: So, first of all it is a distributed architecture of course.
Fritz Ferstl: So, you have, what we call on the execution daemon which is our execution agent. And that runs all of the nodes where execution is being done. Right now, that’s also true for the host nodes where a Phi or Nvidia GPU is plugged in. Although there is also talk about at least for the Phi to run a Grid Engine there. Then we have our central control unit which is a parallelized node in itself, so it is a multi-threaded architecture to that end and we’re looking into even widening that space because as I mentioned there was all those scalability challenges and at some point even thought the notes got bigger and bigger and nowadays they almost look like E25K systems, you know.
And we’re trying to make use of them but there is definitely a need for spreading it out even broader.
Rich Brueckner: I have kind of a wrap up question here, Fritz. You started out of something near and dear to my heart. It’s that the node count is kind of flat but this cores are just going to keep increasing. What kind of advice do you have for customers of– when you know that suddenly Knights Corner will come around, and suddenly there’ll be sixty cores there or more and that might go up 5X in the same rack. How do they prepare for that? How do they deal with what’s coming?
Fritz Ferstl: I think the major issue there is really the applications. Even today we have customers who buy those big core machines just for the memory which kind of absurd, right? You buy a 32-core machine and then maybe using eight because the applications is just not using more but you need all the memory. I think the customers should think real hard what makes sense for them, and then again, I think embracing difference is an advice there as well. Maybe you don’t need all your clusters being equipped with the same type of nodes. Maybe for your applications, you split that a little bit and you put specific notes which fit one application best, and a part of them and other nodes in another part.
Rich Brueckner: Well, Fritz, this has been fascinating, and I’d like to thank you once again for coming on the show today.
Fritz Ferstl: You’re welcome and thanks for having me.
Rich Brueckner: You bet! Okay, we’ll see you at ISC.
Fritz Ferstl: Yeah, absolutely.
Rich Brueckner: All right. Okay, folks, that’s it for the Rich Report. Stay tuned for more news and information on high performance computing.
Sign up for our insideHPC Newsletter.