Interview: Happy Sithole on Workload Management at CHPC

happyAt the recent CHPC National Meeting in South Africa, Dan Olds got a chance to chance to sit down with CHPC Director Happy Sithole on behalf of insideHPC. In the video interview shown below, Sithole describes how CHPC is on the fast track to acquire over a Petaflop of computing power in the next year to meet the demands of its regional scientists and engineers. That represents a daunting task for system management, and CHPC is already using PBS Professional software from Altair to keep their systems systems fully utilized.

insideHPC: How are you, Happy Sithole? It’s the last day of your conference and it was a huge success. It looks like we had what, over 400 attendees?

Happy Sithole: Yes, we had about 400 attendees, and we had a lot of internationals including lots of delegates from 19 countries.

insideHPC: Plus, right here at the Kruger National Park, so we, in addition to all the HPC, we got to go out and see the animals and a lot of animals see us.

Happy Sithole: That was the whole idea.

insideHPC: It was great.

Happy Sithole: We bring everybody here so that we do HPC in a different way.

insideHPC: This is incredibly unique. I just want to talk to you a little bit about what you guys have here, the mission of CHPC. How do you see that today?

Happy Sithole: Yes, the mission of CHPC is to try to grow high performance computing in South Africa. But, we’re also looking at growing it into our continent as well.

And many have tried. It’s not just to be having big machines in the continent. It’s what sort of social-economic problems can we solve with high performance computing? So, some of those challenges we’re addressing would be, how do we improve service delivery? And, how do we improve economy competitiveness of the industries?

insideHPC: Build new industries and make the existing ones much more efficient?

Happy Sithole: Yes, and another thing we want to do is really to be able to come up with skills. If we can get good skills in the country, we can foster the ability to progress.

insideHPC: And as part of that mission, you building up the CHPC supercomputer infrastructure quite a bit. Can you tell us a little bit about the size of the HPC systems you have now?

Happy Sithole: At the moment, the largest system that we have is about 61 teraflops. That’s not going to be for long. We’re planning to have a major upgrade. This should start happening now, in the new year, and we want it completed by May or June at the latest. But the whole objective is to be able to have part of the system already available to our users because we have got a lot users already.

insideHPC: And you’re making more of them all the time with the student cluster competition.

Happy Sithole: You could imagine. The queuing is almost about a month. So it’s not a good situation to be. We want to offer services that they won’t have to wait for such a long time. So we need to build resources, but this is demand driven at this stage because we’ve got lots of demands.

insideHPC: Excellent. How much capacity do you expect to have by this time next year?

Happy Sithole: By this time next year I would say we should have over a petaflop of computational capacity.

insideHPC: Not just a peak petaflop, but useful petaflop, right?

Happy Sithole: And it’s based on a architecture that people want. Yeah, a usable petaflop. That’s right. And that’s the key. It’s a fully usable and utilized petaflop.

insideHPC: So given the scope of these computational resources, how are you handling workload management?

Happy Sithole: Workload management is very key that if you look at where we’re coming from. The minute you offer the huge resources like this, lot of users will come in with varying workloads. It’s very difficult to manage it if you don’t have a system that’s easily automated, that can help you manage that. You have to make sure that you meet each and every user’s requirements. So that’s very key in our current situation.

insideHPC: I think someone mentioned that you’re using Altair PBS Works.

Happy Sithole: Yes. We have started using Altair since January this year.

insideHPC: So you’ve had about almost a year with it then?

Happy Sithole: Yes. It’s about a year that we started using Altair. It has been very good. I think, currently, my technical manager is having good sleep. Because before she was having sleepless nights. With our previous workload manager, it was problematic whenever we had to go in and do any changes to the system. Now it seamlessly just works without any problems. And also, managing the workload definitely has become much easier.

insideHPC: And, when you’re moving from 69 teraflops to over a petaflop in probably about a year, that means many more users, many more workloads, and it’s all the more important to make sure that you’re getting every bit of utilization out of that system.

Happy Sithole: Yeah, absolutely. And, I’m looking at making sure that we can be able to utilize those resources efficiently, because we’ve got so many resources. If you don’t have a proper workload manager, then we might not be utilizing them efficiently. So that’s one of the key things, to make sure that at our line-up, we can get a good handle on it.

insideHPC: And so you’re going to be using Altair on the new systems as well?

Happy Sithole: We are going to be using Altair on the new system because it’s our current work load manager, and so this will be our work load manager for the upgrade in a years time. We have got an agreement to work with Altair for the next three years.

insideHPC: There’s certainly plenty of competitors out there.

Happy Sithole: Absolutely and for the RFP, Altair came out with a very solid proposal from the beginning. This was great as we had a very tight deadline from the time of publishing of our RFP proposal to implementation. This was going to happen during the Christmas break. So it was a very short.

insideHPC: That’s a tight timeline.

Happy Sithole: So I think I was praying perhaps. Then in January when our people came back from holidays, our system was up and running. So I think that they were impressive in how they did the installation. But after that, during the implementation phase, they came and listened to our system administrators to understand the pressure points and from there they did customize the PBS program. So our system administrators were then able to get the right information that they needed.

So with that, I think it was more of partnership, and that’s one thing that we appreciate here CHPC.

insideHPC: And that’s what you need, particularly when, again, I keep thinking of going from 69 teraflops all the way up past a petaflop in such a short amount of time for a young organization. Are you aware of some of the things — I’ve heard a little about this, what’s coming the PBS version 13, and their massively scalable system?

Happy Sithole: Yes.

insideHPC: Can you talk about that and how it might help you?

Happy Sithole: Those are some of the things that I’m looking at, because everywhere you go on this scale, we’re looking for something that can really be able to adapt. As you know, we might be ending up with some extreme scaling. I don’t like to say endless gain, because I’m talking extreme scaling HPC systems.

insideHPC: I think a petaflop is still, in a lot of ways, extreme scaling. There’s a lot of systems out there but we’ve nearly crossed the petaflop market, what 2008, or 2007?

Happy Sithole: Yes. That’s correct.

insideHPC: Not very long ago. That is still an extremely large system.

Happy Sithole: That’s true, and it’s still a nightmare to manage workloads of those systems.

insideHPC: It sounds like PBS did a pretty good job of showing you that they’re going to grow with you, and that their capabilities are going to be well ahead of your requirements for the coming years.

Happy Sithole: Absolutely. I think it has been a pleasant year to work with them. I’m looking forward for the next coming two years because we have an agreement for three years to work together with them, and hoping that it will be the same as they have been in the first year of our collaboration.

Sign up for our insideHPC Newsletter.