“Modern high performance computers are built with a combination of resources including: multi-core processors, many core processors, large caches, high speed memory, high bandwidth inter-processor communications fabric, and high speed I/O capabilities. High performance software needs to be designed to take full advantage of these wealth of resources. Whether re-architecting and/or tuning existing applications for maximum performance or architecting new applications for existing or future machines, it is critical to be aware of the interplay between programming models and the efficient use of these resources. Consider this a starting point for information regarding Code Modernization. When it comes to performance, your code matters!”
Transcript:
insideHPC: Hi. I’m Rich with insideHPC. We’re here at SC15 in Austin, Texas and we’re here at the HPE booth. I’m here with Vineeth Ram and Dave Mullally. Vineeth, what are you demonstrating here with Code Modernization?
Vineeth Ram: We’ve tailored our story for the HPC developers here, who are really worried about applications and performance of applications. What’s really happened traditionally is that the single-threaded applications had not really been able to take advantage of the multi-core processor-based server platforms. So they’ve not really been getting the optimized platform and they’ve been leaving money on the table, so to speak. Because when you can optimize your applications for parallelism, you can take advantage of these multi-processor server platform. And you can get sometimes up to 10x performance boost, maybe sometime 100x, we’ve seen some financial services applications, or 3x for chemistry types of simulations as an example. So the whole idea here is to be able to modernize these single-threaded applications to be able to flip them around to do more multi-threading and multi-processor server platform, how to take advantage of that is really what we’re trying to help communicate to developers, because we’ve got a great story from Hewlett Packard Enterprise and Intel to help application developers do exactly that.
insideHPC: That sounds like– that’s going to be even more important as Moore’s law slows down, we don’t get that free performance boost. You got to take advantage of that parallelism in the hardware, don’t you?
Vineeth Ram: Absolutely, Rich. And what’s really happening is, you know there’s a fight for performance here, we want every single bit of performance we can get. The nice thing is, on the hardware side, vendors like Intel have been doing that, and HP has been doing that too to get to more and more teraflops of capacity, for example. But the trick really is about how the application is written and how it can be modified to take advantage of that hardware, otherwise you don’t get it. So what we’re really doing is– our story here for developers and how we want to help them is basically the five things we can do to go help them.
First is we can do an entire software modernization plan for these developers, right? Sit down with them, understand the environment and do that for them. We can actually go understand with help of software tools, the inner guts of what’s happening? What kind of CPU utilization the actual code is doing? Characterize that to understand some of that, give them information around that. We can actually provide them some specific– we have subject matter experts in our center of excellence to give them specific help with their specific application workloads and ideas and how they can tune it. We can also help developers pre-test their applications on latest hardware platforms, both from HP and from Intel, so we can do that. Finally, we can help them actually modernize their code. As I told you, we take that from the single-threaded applications and make it more multi-threaded, parallel-based and take advantage of the multi-core multi-processor platform environment.
insideHPC: Sounds like a very comprehensive steps in ways to get this done. Dave, can you give us a little bit more on the specifics on what do you?
David Mullally: Sure, I’ll be happy to. Can I show you a few things? These are some examples of improvements that you can get with code modernization that I have up here. We’ve got– these are just a bare results using BWA, which is a next-generation genomic sequencing code, and LAMMPS which is a micro dynamics application. And we have two data sets with LANCE and you can see that the improvement that you can get is off times depending on the day that you actually use. So if we look at this, we would like to look at some details, and we are using tools to do this. This is what’s happening with the CPU usage as we run the application. This is BWA, we’re looking at two sequences against a genomic data base, and what you can see here is in blue, this is– that’s the application running before modernization.
So we are running about 70% efficiency. We’ve got a lot of big peaks and valleys, so we’ve got some problems with communication. We’ve got problems with load balance in general. After modernization, we get a much more interesting graph, because we’ve got– in the darker color, we can see that we’ve got nearly 100% CPU utilization as we went along. And you can see we’re actually doing two sequences there and there is a break in between, which is fine. By improving the CPU utilization, we are getting an improvement in speed of about 32%.
insideHPC: And that drop off, that’s the code completed then?
David Mullally: Yes, the code completed here for modernized and this is where it completed with the original code. So–
insideHPC: So it’s 32% speedup; is that a realistic number to expect on these kinds of things or is it a good use case here?
David Mullally: It’s a good number, and it’s free. I like free [laughter]. For me, free is a big thing. There are other applications that I’ve got here. This is LANCE, and what’s fun here is this is completely different. They’ve done an excellent job of working on the parallelization here. You’ve got 100% CPU utilization as you’re running. What you can see though is this, you start off and again the dark color is with the modernization, and that’s about 15% shorter than before modernization – again, we’re using 100% beautiful parallelism. If I looked at the code, the profile would say, “Can’t be helped,” but modernization really do, so that’s nice. With the same application, this is a different data set here, this is running a liquid-crystal calculation, different potential, different path through the code, and instead of just a few percentage difference, we have a 326% difference in speed because of code modernization, so this is really stunningly good.
I’ve got another thing here, because again we like cheap and we like free, so we’ve gotten an example here using the Xeon Phi. This is using Abaqus/Standard, which is a structural application, and what you see here is the times in gray, these are relative performance that is without using the Phi. The light blue relative performance is showing the performance using two Phis. And what you can see here is that we’re getting up to 34% improvement in speed on one of our test cases and that’s great. Now with Abaqus/Standard, the important feature is to have the number of tokens. The tokens cost more than the machine. So, think of the hardwares being free, the softwares being the cost. I care about the number of tokens, so if I look at the number of tokens that I’m using for that, and this is showing the relative performance per token. What I can see here is again– it’s close to free. In this case, we’ve got an over 30% improvement in speed for a cost of around 3% in terms of tokens– excellent value. So we can get that with the Phi and it’s not free, but it’s cheap, and I will settle for cheap [laughter].
insideHPC: All right. Could we expect as we moved to Knights Landing, similar kinds of potential performance gains with that kind of architecture?
David Mullally: With Knights Landing, I would expect to see better performance gains, so that’s going to be great. The truth is that the token situation needs to be negotiated.
insideHPC: Yes, I got you.
David Mullally: So you don’t want somebody to go out of business because the Phi is just too fast and too cheap [laughter].
See our complete coverage of SC15 * Sign up for our insideHPC Newsletter