Best Threads Per Core with Intel Xeon Phi

Print Friendly, PDF & Email

phiSponsored Post

Modern CPU designs are amazing systems that contain a number of cores and can handle a number of threads per core during a running application. With many applications having been created or rewritten to have a number of threads running simultaneously, it is important to understand the capabilities and limits of the underlying hardware to maximize performance.

When designing an application that contains many threads and less cores than threads, it is important to understand what is the optimal number of threads that should be assigned to a core. This value should be parameterized, in order to easily run tests to determine which is the optimum value for a given machine. One thread per core on the Intel Xeon Phi processor will give the highest performance per thread. When the number of threads per core is set at two or four, the individual thread performance may be lower, but the aggregate performance will be greater.

Each thread that runs on a core will compete for memory resources. Thus, the more threads running, the more demands on the memory there will be. However, if more than one thread  is running on a core, and the application is sensitive to memory latency, the more threads running on a core is better, as there is the increased chance that the data needed will be ready for processing at any given time.

Due to the uarch of the Intel Xeon Phi processor, three threads per core are less likely to perform better than two or four threads for each core.  However, researchers have found that in  some cases, three threads per core does perform well, which points out that the application is not sensitive to available memory resources and there are other bottlenecks that must be investigated.

To get top performance, only using one thread per core is probably the best choice, with two threads per core the next best alternative. It is important that once an application is optimized and tuned, that the developer experiment with the number of threads per core that might be used in production.  The ability to choose this value at run time is very important to get the highest performance for a given application.

Transform data into opportunity. Speed data analysis in your applications.  Get Intel® DAAL

Comments

  1. What are developers at commercial ISVs, say in CAE, seeing in terms of performance here?

  2. >To get top performance, only using one thread per core is probably the best choice,

    Well, such generic statement is totally senseless, even with the word “probably” it makes a little sense without reference to the kind of algorithms been used for bench-marking.
    Brute-force memory intensive well optimized algorithms definitely/very-likely may perform better with single thread per core but most of “smart-adaptive” algorithm, the type of algorithms spending most of CPU time to minimize accessing to the main memory performs a way better by utilizing all hyper-threading/logical cores. All tests relevant for my development shows on average a double performance gain with quad-HT vs disabled HT. Frankly GPU/CUDA is a perfect for brute force algorithms but brute force approach scales badly and it is the reason why Phi/MIC development is so welcome, my preference would be to double-triple number of cores by sacrificing AVX (SSE2 would be enough).