Interview: Powering up Vectorization with Intel Parallel Studio XE 2015

Print Friendly, PDF & Email

In this video, James Reinders describes the powerful new features in Intel Parallel Studio XE 2015.

As a new feature for our insideHPC videos, we are including the full transcript below:

insideHPC: James, I wanted to ask you about the latest version of Intel Parallel Studio XE 2015. What’s new with that product?

James Reinders: Well, we actually upgraded the name to 2015, getting ready for next year. It’s a suite of tools – compilers, libraries, analysis tools – that are very popular, especially in HPC, for really squeezing out performance. For me, the coolest features are some of the new things like explicit vectorization – the ability to tell the compiler when to use SIMD instructions more explicitly. It’s a new style of programming and OpenMP 4.O, which includes that type of functionality. Of course, we’ve done all the things to stay ahead of the competition and worked very hard to optimize our libraries and our compilation capabilities, and keep up with standards – not just OpenMP 4.0 but a myriad of other things like MPI-3 and Fortran 2003 and C++11 – keeping up.

The thing that really excites me is looking at OpenMP 4.0. We’ve got virtually a complete set of 4.0 features. Only the user-defined reductions are missing right now. The exciting thing to me is that OpenMP 4.0 brings together tasking, which it’s had since its start in ’97, with new capabilities for vectorization and for offload. Bringing those together, and being able to do them at the same time, is extraordinarily powerful. I love teaching classes about it and seeing what people can do with it. And, now, it’s fully in our products – the support for this.

studioSo, you can do parallelization and vectorization. You can take a loop and, say, parallelize it across a bunch of cores, but then when you’re running on each core, vectorize it there. And the vectorization capabilities, we call them explicit vectorization. And that means that the programmer goes in and says, “Hey, I’m sure that vectorizing this is a good idea. Do it.” It ends the tradition of the compiler deciding whether vectorization was good or not – and you trying to tease it with switches, and with hints, and this and that.

Now, this explicit programming seems to work really well for developers. I’ve seen a lot of coders using this style. We’ve had it in our compiler for a few years, but recently it was added to open OpenMP 4.0. So, we support the 4.0 capability fully. And it’s very effective. It’s really worth looking at if you’ve done this before. And the reason that people are looking at vectorization a lot more, now, is the vector lengths have gotten so wide.

For years, we’ve had SSEs able to do two double precisions, or four single precisions, at a time. And that’s pretty exciting, but now we are talking with AVX-512 about being able to do 16 single precisions at a time, or eight double. If you ignore that, you lose a lot of performance, even at an application level. The current levels obviously can speed up a lot – maybe up to 16X on single precision. On the whole application, obviously it will be less, but it’s gotten significant enough that people are taking a look at it again. And it’s the perfect time because 4.0’s got this capability.

insideHPC: When I think about vectorization, I think back to the old Cray days and Fortran, but you’re bringing this to C++ and other more modern languages?

James Reinders: Absolutely. Even loops in Fortran don’t look like they used to – the good old Livermore loops that we used to vectorize against. There used to be a small number of them. They were pretty simple. The sort of things people want to vectorize now are more complex, whether it’s in FORTRAN or C or C++. And we really have this capability. It’s quite interesting to look at – quite exciting.

When we were talking earlier this week, you were saying how, now, the ability of these scientists to write this parallel code that looks a lot closer to their science, rather than reworking it to make it go fast and parallel. How does it work?

Absolutely, because usually if you’re hand coding it and trying to get the compiler to produce code exactly like your code, you want the outermost loop to be parallel, the innermost loop to vectorize. Maybe if there isn’t enough work, you collapse a couple loops levels together. And the next thing you know, you end up with code that doesn’t look like the math equations you drew on the chalkboard.

With OpenMP’s 4.0, there’s a few capabilities – the ability to say, “parallelize this, vectorize that, or do them both.” Then, you can use this collapse directive that says, “Hey, even though I’ve written it as several loops, I really want you to collapse it into one, and think about it as one problem to parallelize and vectorize.” And you can give a few parameters. Next thing you know, you’ve got the code written like the science or like the math was on the blackboard that you were writing it out on. It’s much more readable and it gets better performance because of the capabilities of the compiler.

insideHPCTerrific. So is this shipping today?

James Reinders: It is shipping today; our customers have been downloading the new version like crazy. And you can also go and get an evaluation copy and check it out, if you’re not already a customer.

Sign up for our insideHPC Newsletter.