The combination of using both MPI and OpenMP is a topic that has been explored by many developers in order to determine the most optimum solution. Whether to use OpenMP for outer loops and MPI within, or by creating separate MPI processes and using OpenMP within can lead to various levels of performance. In most cases of determining which method will yield the best results will involve a deep understanding of the application, and not just rearranging directives.
Experiments and test cases can be created that compare the results of fairly simple uses on a range of hardware. Running a set of test applications on both a typical server with the latest Intel Xeon CPUs and the same test using the Intel Xeon Phi coprocessor can show relative speeds. As the number of concurrent processes are run, scaling can be determined and comparisons made. Assuming that the mechanics of using OpenMP or MPI does not affect the actual running of the code, using a host processor will show better performance for some cases.
An application is more than just running a simple test on a number of cores. In reality, the memory system has a large impact on the performance of the application. On a host based system, there is more memory available to each processor or core, which can have a great impact on the performance. Although the performance will vary slightly based on the number of threads used, an application developer must consider the memory layout and amount of memory needed for each thread. Simply looking at number of cores to use for computation will not be sufficient.
When looking at the performance of lower level kernels, experiments can be done to determine the optimal number of MPI tasks and OpenMP thread can be performed in a controlled manner. These techniques can later be used when optimizing a larger system.
Source: Intel, USA