In this sponsored post from Ian Colle, General Manager-AWS Batch and HPC, for Amazon Web Services, he explores how we are always looking for resources to “do everything we want to do.” What if the number of cores were virtually unlimited?
Best Threads Per Core with Intel Xeon Phi
“When designing an application that contains many threads and less cores than threads, it is important to understand what is the optimal number of threads that should be assigned to a core. This value should be parameterized, in order to easily run tests to determine which is the optimum value for a given machine. One thread per core on the Intel Xeon Phi processor will give the highest performance per thread. When the number of threads per core is set at two or four, the individual thread performance may be lower, but the aggregate performance will be greater.”
MultiLevel Parallelism with Intel Xeon Phi
“The combination of using both MPI and OpenMP is a topic that has been explored by many developers in order to determine the most optimum solution. Whether to use OpenMP for outer loops and MPI within, or by creating separate MPI processes and using OpenMP within can lead to various levels of performance. In most cases of determining which method will yield the best results will involve a deep understanding of the application, and not just rearranging directives.”
Reserved Cores in the Intel Xeon Phi
“In the case of the Intel Xeon Phi coprocessor, although 60 cores are commonly used for computation, there is another core that is available, but not traditionally used as part of a simulation. Experiments using the 61st core for actual computation while running a reverse Monte Carlo ray tracing application for the modeling of radiative heat transfer, demonstrated that the use of another core improved performance, and that oversubscribing the coprocessor operating system thread did not degrade the performance.”
Why Modernize Code?
In order to speed up applications, a developer must learn to take advantage of the multiple threads, cores and sockets found on a single server or on a cluster. Just hoping for a faster CPU anymore won’t cut it.