OpenMP at 20 Moving Forward to 5.0

Print Friendly, PDF & Email

Sponsored Post

This year, OpenMP*, the widely used API for shared memory parallelism supported in many C/C++ and Fortran compilers, turns 20.  OpenMP is a great example of how hardware and software vendors, researchers, and academia, volunteering to work together, can successfully design a specification that benefits the entire developer community.

Today, most software vendors track OpenMP advances closely and have implemented the latest API features in their compilers and tools. With OpenMP, application portability is assured across the latest multicore systems, including Intel Xeon PhiTM processors.

Just this week, the OpenMP team concluded two weeks of meetings and presentations at Stony Brook University on Long Island, NY, part of its annual OpenMPCon user conference, and International Workshop on OpenMP (IWOMP).  All this activity is to prepare for the release of a preview draft of the OpenMP 5.0 specification for public comment at SC17 this November in Denver. The final 5.0 specification will be released at SuperComputing ’18 in 2018.

Several major extensions are anticipated for OpenMP 5.0, including extensions to the tool interface, and features to facilitate debugging OpenMP applications. Also, new memory allocation mechanisms for indicating which memory should be used for particular variables or dynamic allocations in systems with multiple types of memory. (Many of the new features intended for that final OpenMP 5.0 specification will be discussed at SC17.)

[clickToTweet tweet=”Intel Parallel Studio XE 2018 tracks the latest OpenMP advances.” quote=”Intel Parallel Studio XE 2018 tracks the latest OpenMP advances.”]When the first version of the OpenMP (Open Multi-Processing) specifications was released in 1997, the intention was to provide an easy way to bring shared-memory parallelism to high performance applications written in Fortran. Up to then, parallel programming used an explicit threading model like pthreads, or a distributed memory framework like MPI. Both involved restructuring an application to fit the model, and then adding runtime library calls to implement parallelism. And both approaches were difficult to program and time consuming, and meant maintaining a sequential version and various platform-dependent parallelized versions.

OpenMP took a different approach. It relied on pragmas — source code directives the programmer adds to give the compiler clues about loops and how they could be parallelized. The OpenMP designers realized that the programmer usually knew a lot more about the program than an always-cautious compiler could discover from its own static analysis of the source code. So by making the directives simple and flexible, programmers could start thinking in parallel terms and rely on the compiler to sweat the implementation details. With OpenMP, the programmer could be assured the same program would run on any compiler that implemented the OpenMP pragmas. Or, the program ran sequentially, eliminating the need to maintain both sequential and parallel versions of the same code.

Over its twenty year evolution, OpenMP continues to enable various parallelization strategies and opportunities. Still, knowing if the parallelization strategy in your program gives the best performance, and if a different strategy might do better requires specialized tools. The programmer still needs to relate actual performance to the OpenMP constructs in the code, show where the parallel and sequential time is spent, and display ideal versus measured CPU utilization in parallel regions. With this kind of information, the programmer can discover where tuning the code will result in the biggest gain.

Intel® VTuneTM Amplifier XE, along with Intel Advisor and Intel’s OpenMP-aware optimizing C/C++/Fortran compilers in Intel Parallel Studio XE 2018, enable expert tuning of industrial-strength OpenMP codes. Starting with general CPU utilization, which points to where the code is purely serial, potential gain metrics direct the tuning focus to parallel regions with the best potential gain. Intel VTune displays measured OpenMP performance metrics, such as overhead and waiting time, to help identify the root cause of performance problems: load imbalance, non-optimal granularity, or memory latency. Intel Advisor offers hints for improving loop vectorization to ensure that the full performance potential of the underlying hardware is being utilized.

Working together with Intel C, C++, and Fortran compilers, Intel VTune Amplifier and the analysis tools that comprise Intel Parallel Studio XE 2018, you have the best environment for developing OpenMP programs that successfully utilize the full potential of today’s processors.

Download your free 30-day trial of Intel® Parallel Studio XE 2018