Modern CPU designs are amazing systems that contain a number of cores and can handle a number of threads per core during a running application. With many applications having been created or rewritten to have a number of threads running simultaneously, it is important to understand the capabilities and limits of the underlying hardware to maximize performance.
When designing an application that contains many threads and less cores than threads, it is important to understand what is the optimal number of threads that should be assigned to a core. This value should be parameterized, in order to easily run tests to determine which is the optimum value for a given machine. One thread per core on the Intel Xeon Phi processor will give the highest performance per thread. When the number of threads per core is set at two or four, the individual thread performance may be lower, but the aggregate performance will be greater.
Each thread that runs on a core will compete for memory resources. Thus, the more threads running, the more demands on the memory there will be. However, if more than one thread is running on a core, and the application is sensitive to memory latency, the more threads running on a core is better, as there is the increased chance that the data needed will be ready for processing at any given time.
Due to the uarch of the Intel Xeon Phi processor, three threads per core are less likely to perform better than two or four threads for each core. However, researchers have found that in some cases, three threads per core does perform well, which points out that the application is not sensitive to available memory resources and there are other bottlenecks that must be investigated.
To get top performance, only using one thread per core is probably the best choice, with two threads per core the next best alternative. It is important that once an application is optimized and tuned, that the developer experiment with the number of threads per core that might be used in production. The ability to choose this value at run time is very important to get the highest performance for a given application.