In this video from the 2015 HPC Advisory Council Switzerland Conference, Dr. Herbert Cornelius from Intel presents: HPC and Software Modernization.
As we see Moore’s Law alive and well, more and more parallelism is introduced into all computing platforms and on all levels of integration and programming to achieve higher performance and energy efficiency. We will discuss Multi- and Many-Core solutions for highly parallel workloads with general purpose and energy efficient technologies. We will also touch on the challenges and opportunities for parallel programming models, methodologies and software tools to achieve highly efficient and highly productive parallel applications. At the end we will take a brief look towards Exascale computing.
For the next 30 minutes or so before coffee break, I would like to try to convince you that it’s good to modernize software. Is anyone of you developing applications, tuning applications, working with people who are optimizing and writing applications? Yeah, a few people. I’ll talk about what we think the current situation is, what needs to be done, and also talk a little bit about how it can be done.
My presentation title is Software Modernization, in the sense that software applications, or code, should be able to take advantage of modern performance technologies. As the hardware is evolving, and continues to evolve over time adding more and features to get more performance, certainly it would be good if software can be able to utilize all those technologies. If not, software will leave performance on the table today, and probably even more in the future if the software is not modernized in a way that it can take advantage of the underlying hardware.
Have a look at some historical data and some predictions. If you look at the TOP500 list, you’ll see in 2003 the number one machine had about 35 teraFLOP performance. In 2013 it was 33 petaFLOPs – from teraFLOPS to petaFLOPS in ten years. If this continues the same way, in another ten years, by 2023, we should be able to be in the exaFLOP era for computing. If you look at the same time, you see in 2003 for the number one system, about 5000 cores were used. Ten years later it was about three million cores – going from thousands to millions in ten years. And then the estimation is, if we continue this trend that in another ten years, by 2023, we should see about 2 billion cores – thousands to millions to billions.
So, what does it mean? Basically it means that probably parallelism will be there. And we see that this is an industry trend; it’s nothing specific to a certain vendor. High performance is basically achieved through parallelism, and it’s parallelism on all levels. It’s inside a core, it’s across a node, and it’s across different nodes or a cluster for example. Is there anybody who disagrees that parallel is a path forward for high performance computing? Well, seems not to be the case, so I don’t need to preach to the believers.
Have a look back in the 70s. This is a picture of a portable computer from the 70s, it’s an IBM 5110, it was probably where you carried this with a bag or something, I don’t know. But this is like history, and it was a 16-bit processor, and the question is; if software was written for this platform – it might be in Fortran, it might be in Basic or whatever – what are the chances that this software will be able to take advantage of modern processors today, like Multi-Core Xeon or mini-core Xeon Phi? I’m not sure, but your chances are, I would say, probably not that high.
If you look at what’s happening on the processor-side, you see we are getting more cores. We have two product lines: Intel Xeon, which is a multi-core architecture, and the Xeon Phi which is a mini-core architecture. We will continue to add more cores on the Multi-Core Xeon product line, but we’re still trying to make a single core as fast as possible. We have Intel Xeon Phi 512-bit SIMD today on Knight’s Corner. We go to AVX-512, which is a new coming instruction set between Xeon and Xeon Phi, and it will be implemented first in the next generation of Xeon Phi, codename Knights Landing, and it will also be implemented in future Xeon processors.
You see that we are going parallel, with multi-core on the Xeon-side, and mini-core on the Xeon Phi side. If an application and software are able to implement all this functionality it should be good, you should have good potential for high performance. If not, you’ll be leaving performance on the table. To some extent I think there is a gap, there are certainly applications that can utilize it, but there are probably more applications that cannot utilize all the performance technologies available. We have a gap, we have software on the left hand, and all this parallelism and hardware on the right hand, and the question is; How do you get there without too much pain, and in a way which is sustainable? I’ll try to show you how we see it can be done.
Driving the urgency for code modernization is recognition that it is both a technical and an economic imperative. Code not modernized equates to system performance and financial advantage left on the table. Outdated code equates to longer run times, more test and design iterations, more modeling within a given period of time, and slower simulations. Bottom line: higher operating costs and less revenue for organizations.”
This is a quote from an article appeared in Scientific Computing, by Doug Black, and he basically talks about the urgency of code modernization. The software should be parallelized, optimized, modernized, and it’s basically the problem of outdated code which is not able to utilize all these modern technologies. It’s not only technical it’s also an economic imperative, because if software is not taking advantage, companies who use this software will be lagging behind in respect to what they can do, in respect to performance, faster time to market better product, newer product, and so forth. So there’s not only technical needs, there’s also a probably even more economical need to modernize software.
There are two different ways you can go. One is an open industry standard portable way, with reusable code using industry standards like OpenMP 4.0, nowadays MPI and others, and there are some other ways which are a little bit more one-way direction, where you go do some tuning, optimization, modernization for more specific devices like graphic arts, like FPGAs, or probably even ASICs. And sometimes it’s probably a good way, it certainly depends where you want to go, what you want to do, how much effort you want to put in, and how much you are able to leverage the work you’re doing. And we think using open standards, you are able to have a portable, scalable, and sustainable solution versus some others where you are really kind of locked in.
In doing so, there are something called underflow, and I’m sure everybody knows underflow. I read some articles recently on the web, where the people say underflow is not valid anymore, I’m not so convinced. Especially driving on a car from A to B, I’m always reminded about it, it doesn’t matter how fast you go, it always takes the same time because of traffic jam and what have you. It certainly is still there, and there is a quote from an IBM folk saying, “Everyone knows Moore’s law, but quickly forgets about it until it strikes back and hits you again.” This is where I am trying to show you, because we have SIMD today in our processors – and not only in ours, in all modern processors for performance they use SIMD one type of shape and form. Depending on how good you can use it, you can get good performance.
This is an example for a 61-core processor with 8 Ray SIMD per core. And you see, depending on the fraction of operations you can do in SIMD Vector, and depending on how much operations you can do in parallel for multithreading, you can get certain performances on this curve. And you see, if you have a high degree of parallelism and a high degree of SIMD, this is where you’re really seeing the benefits. It basically tells you that here Moore’s? law strikes twice, it strikes twice for using SIMD inside each core and it strikes a second time for parallelization across different cores. You should try to actually utilize both as much as possible, and yes your mileage will vary depending on where you are with the application on this curve.
This is just an example to illustrate what it looks like, I mean you’re only– we’re not using all the resources which are available. If today on a multi or mini-core processor, if you only use one core and no SIMD, you’re actually using only very little piece or fraction of the chip. It kind of indicated in these highways as just there are lots of lanes, but only one car going. If you use SIMD within a single core, it’s already better, you’re using more, but you’re still not using a lot of the chip and the functionality which is implemented, so you have kind of couple of cars going behind each other you can– then you really start using all the capabilities of modern processors, SIMD inside the core and multithreading across the cores, then you can really see when we have this parallel vector execution where you can get good performance, good support, using all the underlying technologies. And then in addition, you can basically add highways to each other, and then you basically have like a cluster with multiple nodes, each consisting of processors which have multiple cores, which have SIMD in it.
This is an interesting example, which shows you what happens when you don’t use parallelism. It shows you performance for different type of execution from a financial segment, a binomial operations calculation, and it shows you a processor 2007, -9, -10, -12, -13, -14, and the blue line on the bottom, it says SS, which is single-threaded Scalar. The benefits you see over the last seven years is– you see some benefit, but it’s relatively small because that’s– the scalar performance per generation improves by something… 15, maybe 20%. Right? We’ll continue to do that if it just stay scalar and single cell– single-threaded. The next one is the red line, which says vectorized and single thread , it’s a single core but it’s using the inside. Then you see there are some improvements, and it comes when we improve SIMD and hardware going from SS800and 28-bit to 256-bit of the AVX and AVX 2 with FMA, and potentially going to 512-bit in the future, you will see some increase, but it’s also not like a big jump. When you parallelize the grey line you see you also gain a little bit, but not really so much. The real benefit comes with the yellow line, when you vectorize and parallelize using SIMD and multithreading with modern processors, then you see where both effects really multiply each other and to really get good performance improvement, and this is where I think people should try to be with their application.
The good thing is that with Xeon Phi today, codename Knights Corner, we already have a platform which really shows you how good you can utilize with your application the technologies SIMD vectorization and multithreading. It has up to 61 cores, 244 threads, it’s 512-bit wide SIMD, so this platform actually really tells you quite nicely how good your application can utilize the underlying technologies. You might be afraid to try it, because of the answer that your application is not really very well vectorized or parallelized and be able to extract the performance of the hardware which is available today. I’ll show you two examples where people went through the process to learn that.
Here’s one example; it’s an application from Germany– basically they run the code, it’s a Magneto-hydrodynamic code on a Xeon Phi processor, it’s a Sandy Bridge architecture. You see the blue is the original code, lower is beta, so on the CPU single socket, you have a certain performance, and then you run your application on a Xeon Phi, and you see it actually runs slower. Why is that? well it’s, as you will see in the next example, when you move from a 3 gigahertz to a 1 gigahertz processor, I would expect it to be slower, there’s no magic if you cannot really utilize all the functionalities. But then they basically looked at the code, they found that they didn’t fully utilize the SIMD capabilities. After analyzing it, looking at the code, vectorizing the code – you see those red bars – that now– it was already quite well threaded, but not really vectorized. But the red bar now shows you when you’re using the hardware, Xeon Phi actually runs even faster than a Xeon, it’s like twice as much what you would normally expect for optimized applications, and you also see that you also gain performance on the Xeon platform. This is an example where using the Xeon Phi, the micro-architecture, to really analyze the application and understand what’s going on and then do some further tuning to gain more performance.
Another example is from Stanford University. They did some work together with a system integrator company. It’s about a cosmology application called the heat code, and there’s a nice paper describing the work they did. So what did they do? They took their code, they run it on a Xeon they run it on Xeon Phi, and observed that the performance on Xeon Phi was about one third of two Xeon [ePhi?] So why is that? Because you moved the code from a three gigahertz processor to a say, out of order to a one gigahertz in order core, and if you don’t use more than one core and don’t use SIMD, you probably wouldn’t really expect it to run faster – it normally runs slower. The reason was that the application had limited threading and no vectorization, and if you would stop here, you would say, “Yeah okay, Xeon PHI is not good at all,” and just keep going, but that’s not really the real story. Once you look at the code and you vectorize and thread the code, you get a huge performance improvement, and in this case – you see the blue bars – this is actually lock-scale there. You see that after optimization, you’re code now runs 620 times faster on Xeon Phi than before, and 125 times faster than on two CPUs. You could say– you could claim victory now and say, “Hey, yes, I got 600 times more performance on my application.” well, yeah there are some minor details that you’re comparing apples to oranges or bananas, you don’t already compare the right thing with each other.
What you actually do is, when you put the code back on Xeon you ought to see that the code runs much faster on Xeon, and then you basically see some more realistic, and you get that from just by looking at the hardware specifications for optimized code that Xeon Phi is about twice as fast as two Xeons. Which is not that bad, but you also see once you optimize your code for a mini-core architecture, it runs also very well on a multi-core architecture, and then you can even run your application on more than one Xeon PHI on more Xeon CPUs, and you get even more performance.
The good thing after the learning here, is that you can actually modernize your code, you can parallelize it in a portable way, you maintain a single source code. You can run it on different systems, if there’s a Xeon or a Xeon PHI or not, if there’s a CPU or not. And that’s the advantage versus some of the other options you might have if you would code it to a GPU, it basically would just run on a GPU or not, and you have to maintain different source codes. It would probably not even run on a platform where there is no GPU. So, there are different things you have to overcome. Using a portable way, using industry-standard programming models, you are able to really have an application which you can sustain to do the optimization once, and you parallelize– you vectorize on a higher level, and basically let the software do the mapping to the underlying hardware.
So how do you do that? How can you do that? There are lots of different ways – and those are just some buzzwords in respect to parallel computing – there are a couple of layers you can look at. You can look at multithreading, you can vectorization, we’ve seen OpenMP 4.0 is basically the industry standard, but you could also use threading building blocks on Cilk Plus, you could also do hand-threading – there’s with pthreads or Win32 threads – or what have you, but that’s probably little bit more work.
For vectors, you can use a compiler, you can use directives, you can use vector functions, you could use array notations, you could code intrinsics, those are all available things to do. And then you would do potentially some blocking of your data to fit the memory in the cache hierarchy, and then in addition you can even optimize for the data layout.
Using the way for multi-threading, you should do it is through OpenMP 4.0, it’s a latest standard, and there was a refinement of the standard released at Supercomputing last year in New Orleans in November. OpenMP supports SMP markets rating, it now also supports SIMD vectorization as well as accelerate the offloading, so it can do everything other APIs environments can do but even more. And this is just an example on how you could vectorize and parallelize a simple loop. You see the current on the bottom, you have two loops and then you fill an array with data from this function called mandel, and then you want to parallelize here– you just put a pragma operation before the outer loop, and then you can vectorize the inner loop with a pragma cmd, and you do that. You can even do that for a function, if you declare those function as vectorizable. That gives you a very portable, a very high performance way to implement those technologies in quite an easy way.
The latest software from Intel, our parallel studio and professional edition basically support all of those standards; OpenMP MPI 3.0, OpenMP 4.0, we have full C++ 2011 support, and depending on which version, if you get the cluster application, it has everything in it for multithreading, on vectorization, on a node in a core, and also Intel MPI library and the analyzer for MPI. That’s basically the full package to help you get that done.
We are also engaging with the community to help people understand what software modernization is, and actually do it. We are engaging with institutions and customers around the world and it’s growing list of institutions, really working this area, helping the industry to move forward with software optimization. We also now have what’s called the Intel Xeon PHI co-processor application and solutions catalog, which is on the wrap, and you see the URL box there. This is where you can look at applications which are, so to speak, modernized and being able to utilize new modern parallel technologies specifically on the Xeon Phi side. There are also some other readings, there’s a nice book from James Reinders and Jim Jeffers, two Intel guys, on mini-core programming with lots of hands-on examples, how to do things and– wrong button. And there’s also a nice book which is also available as an eBook from some of my colleagues on optimizing high performance computing applications using the Intel cluster tools, addressing the different aspects in MPI multi-threading, micro-architecture, micro-architecture design optimization. If you’re interested, that’s quite a nice book. And with that, we are right on time for coffee, thank you very much.