“The combination of using both MPI and OpenMP is a topic that has been explored by many developers in order to determine the most optimum solution. Whether to use OpenMP for outer loops and MPI within, or by creating separate MPI processes and using OpenMP within can lead to various levels of performance. In most cases of determining which method will yield the best results will involve a deep understanding of the application, and not just rearranging directives.”
As multi-socket, then multi-core systems have become the standard, the Message Passing Interface (MPI) has become one of the most popular programming models for applications that can run in parallel using many sockets and cores. Shared memory programming interfaces, such as OpenMP, have allowed developers to take advantage of systems that combine many individual servers and shared memory within the server itself. However, two different programming models have been used at the same time. The MPI 3.0 standard allows for a new MPI interprocess shared memory extension (MPI SHM).
Barbara Chapman, a leading researcher in programming languages, programming models, and compilers, has been named head of the Computer Science and Mathematics Group (CSM) under the new Computational Science Initiative at the U.S. Department of Energy’s Brookhaven National Laboratory. Chapman is also a professor of Applied Mathematics & Statistics and Computer Science at Stony Brook University, where she serves as a joint appointee affiliated with the university’s Institute for Advanced Computational Science (IACS).
Ruud van der Pas from Oracle presented this talk at OpenMPcon. “Unfortunately it is a very widespread myth that OpenMP Does Not Scale – a myth we intend to dispel in this talk. Every parallel system has its strengths and weaknesses. This is true for clustered systems, but also for shared memory parallel computers. While nobody in their right mind would consider sending one zillion single byte messages to a single node in a cluster, people do the equivalent in OpenMP and then blame the programming model. Also, shared memory parallel systems have some specific features that one needs to be aware of. Few do though. In this talk we use real-life case studies based on actual applications to show why an application did not scale and what was done to change this. More often than not, a relatively simple modification, or even a system level setting, makes all the difference.”
“In this presentation, we will discuss several important goals and requirements of portable standards in the context of OpenMP. We will also encourage audience participation as we discuss and formulate the current state-of-the-art in this area and our hopes and goals for the future. We will start by describing the current and next generation architectures at NERSC and OLCF and explain how the differences require different general programming paradigms to facilitate high-performance implementations.”
“This presentation will describe how OpenMP is used at NERSC. NERSC is the primary supercomputing facility for Office of Science in the US Depart of Energy (DOE). Our next production system will be an Intel Xeon Phi Knights Landing (KNL) system, with 60+ cores per node and 4 hardware threads per core. The recommended programming model is hybrid MPI/OpenMP, which also promotes portability across different system architectures.”
In this video from the Intel HPC Developer Conference at SC15, Kent Millfield from TACC presents: OpenMP and the Intel Compiler. “The OpenMP standard has recently been extended to cover offload and SIMD. The Intel compiler has provided its own implementations of offload and SIMD for some time before the extensions to the OpenMP standard was approved, and that standard is still evolving. This talk describes what you can do with the Intel compiler that you cannot yet do in OpenMP including some where gaps are getting closed soon, and some which will remain for a while. The talk will also highlight where things are done differently between the language interfaces of the Intel compiler and the OpenMP standard. The talk is relevant both to those who seek to port existing code to the OpenMP standard, and to those who are starting afresh.”
Drug discovery has accelerated with the advent of high performance computing and new algorithms. “A structural bioinformatics algorithm, eFindSuite, can be used to demonstrate how moving the code to a highly parallel implementation can speed up the computation, by using both the Intel Xeon processor and the Intel Xeon Phi coprocessor. eFindSuite is implemented in both Fortran 77 and C++.”
The Smith-Waterman algorithm is widely used for pairwise DNA sequence alignment. The computation, consisting of looking for pattern in very long strings of the DNA alphabet, is very demanding. Using the Intel Xeon Phi, tremendous performance gains can be obtained, as long as the algorithms have been modified to take advantage of parallelism.
“Our goal is to enable HPC developers to easily port applications across all major CPU and accelerator platforms with uniformly high performance using a common source code base,” said Douglas Miles, director of PGI Compilers & Tools at NVIDIA. “This capability will be particularly important in the race towards exascale computing in which there will be a variety of system architectures requiring a more flexible application programming approach.”