“The next step is to look at using OpenMP directives to create multiple threads to distribute the work over many threads and cores. A key OpenMP directive, #pragma omp for collapse, will collapse the inner two loops into one. The developer can then set the number of threads and cores to use and return the application to determine the performance. In one test case, three threads per physical core shows the best performance, by quite a lot compared to just using one or two threads per core.”