The Intel Xeon Phi processor, also known as Knights Landing, is a very high core count processor. Since it is binary compatible with Intel Xeon CPUs the same programming models that work on the main CPU will work with the Intel Xeon Phi processor. These programming models should be focused around the modification of the application to take advantage of native parallelism of the processor. This unprecedented parallelism means that an application that is designed to be run in parallel can achieve even more performance than ever before.
The Message Passing Interface (MPI) has been used quite extensively in the past to distribute an application over many systems. While this is still needed when using a number of Intel Xeon Phi processors for one application, there are other task level parallelism that should be considered. OpenMP, the Fortran 2008 DO CURRENT statement, and Threaded Building Blocks (TBB). There are alternatives that may lead to even better performance, but the limitation is that these techniques may not be as portable as those mentioned above and may be hard to maintain.
These models work for tasks that are in a single, shared memory space. Threads and tasks that are created on the Intel Xeon Phi processor will remain on the system in which they were created on.
Loops are typically the first place that developers will look to in order to keep the many cpus busy on the Intel Xeon Phi processor. When investigating at what level to create a new thread, the higher the better. The opportunity of creating a thread should be investigated, and can be given to the compiler by using the Open MP parallel loop directive, (PARALLEL_DO), or the Fortran 2008 DO CONCURRENT method. As mentioned earlier, it is advantageous to give as much work to each thread as possible. In a loop that has millions of iterations of a loop, creating millions of tasks is not wise. There is overhead in the system for each thread that is created and then destroyed.
OpenMP, Fortran 2008 and TBB, as mentioned above are standards that can help to create parallel areas of an application. MKL could also be considered to be part of this family, because it uses OpenMP within the library. OpenMP is well known and has been used for quite some time and is continues to be enhanced. While Fortran 2008 is for Fortran only, remember that there are still significant portions of any cluster that run Fortran based applications. Some estimates are as high as 75 % of cycles used are for Fortran applications. Thus, in order to modernize some of the most significant number crunchers today, Fortran 2008 should be investigated. TBB is for C++ applications only, and does not require compiler modifications. An additional benefit to using OpenMP and Fortran 2008 is that these are standards, which allows code to be more portable.
There are other areas in an application where creating many tasks would be worthwhile. By examining the structure of the application, the developer is best suited on where to create sections of the code to run in parallel. Obviously, loops are a great starting place, but looking higher in the application may yield even more performance.