“HPC has reached an inflection point with the convergence of traditional high performance computing and the emerging world of Big Data analytics. Intel’s HPC Scalable System Framework enables an unprecedented level of system balance, performance, and scalability necessary to meet the demands of bot compute- and data-intensive workloads, today and well into the future.”
The benefits of nested parallelism on highly threaded applications can be determined and quantified. With the number of cores in both the host CPU (Intel Xeon) and the coprocessor (Intel Xeon Phi) continues to increase, much thought must be given to minimizing the thread overhead when many threads need to be synchronized, as well as the memory access for each processor (core). Tasks that can be spread across an entire system to exploit the algorithm’s parallelism, should be mapped to the NUMA node to make them more efficient.
“Applications can be tuned to use both the Intel Xeon and the Intel Xeon Phi simultaneously, without modifying the code to just run on the coprocessor. Using a number of software tools from Intel, performance of a coupled cluster method can be demonstrated to gain a tremendous performance with excellent scaling.”
“As the use of coprocessors increases to speedup HPC applications, it is important to understand how much additional power the coprocessors use. With various measurements and benchmarks arising to calculate the power used during the running of compute and data intensive applications, measuring the power draw from an Intel Xeon Phi coprocessor is important to understanding the best use of resources.”
Designating the appropriate provider for large MPI applications is critical to taking advantage of all of the compute power available. “A modern HPC system with multiple host cpus and multiple coprocessors such as the Intel Xeon Phi coprocessor housed in numerous racks can be optimized for maximum application performance with intelligent thread placement.”
“The combination of using a host cpu such as an Intel Xeon combined with a dedicated coprocessor such as the Intel Xeon Phi coprocessor has been shown in many cases to improve the performance of an application by significant amounts. When the datasets are large enough, it makes sense to offload as much of the workload as possible. But is this the case when the potential offload data sets are not as large?”