Large systems, created from linking together tens to thousands of smaller systems, create an environment for running the most complex applications. With the number of cores in both the host CPU (Intel Xeon) and the coprocessor (Intel Xeon Phi) continues to increase, much thought must be given to minimizing the thread overhead when many threads need to be synchronized, as well as the memory access for each processor (core). Tasks that can be spread across an entire system to exploit the algorithm’s parallelism, should be mapped to the NUMA node to make them more efficient.
Thread library overhead can be of great concern when a high number of threads are created relative to the amount of work to be done per thread. Using an artificial benchmark that computes square roots of an array, the overhead of using a high number of threads can be determined, and minimized. Using the Intel Xeon and Intel Xeon Phi coprocessors together with a range of Intel software tools, the benchmark was implemented and then run with varying numbers of tasks. The efficiency of the application was measured, with a value of 1 computed to be the optimal value.
In the Intel TBB terminology, a task arena is an area where worker threads can share and steal tasks. The user managed task arena allows developers to control the number of tasks that are running simultaneously. A hierarchical arena allows the developer to assign a thread to the closest logical core, in a way that thread entering the second level arena will reside on the same physically close core. Using this method, the nearest indexes will be processed by the hardware threads, which will use the same caches and reduces the QPI latencies between sockets to memory.
The benefits of nested parallelism on highly threaded applications can be determined and quantified. With using both the Intel Xeon in conjunction with the Intel Xeon Phi coprocessor, applications that are memory hungry can be placed and executed so as to avoid attempting to use more memory than on the device (especially true for the Intel Xeon Phi coprocessor limited memory). The ability to reduce latencies on memory access across sockets and coprocessors is important to speed up applications and achieve deterministic execution times.