“When designing an application that contains many threads and less cores than threads, it is important to understand what is the optimal number of threads that should be assigned to a core. This value should be parameterized, in order to easily run tests to determine which is the optimum value for a given machine. One thread per core on the Intel Xeon Phi processor will give the highest performance per thread. When the number of threads per core is set at two or four, the individual thread performance may be lower, but the aggregate performance will be greater.”
The Intel Xeon Phi processor supports different types of memory, and can organize this into three types of memory mode. The new processor from Intel contains two type of memory, MCDRAM and DDR memory. These different memory subsystems are complimentary but can be used in different ways, depending on the application that is being executed. “By using these two types of memory in the same system gives flexibility to the overall system and will show an increase in performance for almost any application.”
To get maximum parallelization for an application, not only must the application be developed to take advantage of multiple cores, but should also have the code in place to keep a number of threads working on each core. A modern processor architecture, such as the Intel Xeon Phi processor, can accommodate at least 4 threads for each core. “On the Intel Xeon Phi processor, each of the threads per core is known as a hyper-thread. In this architecture, all of the threads on a core progress through the pipeline simultaneously, producing results much more quickly than if just one thread was used. The processor decides which thread should progress, based on a number of factors, such as waiting for data from memory, instruction availability, and stalls.”
“Designing a new generation of hardware with such high performance needs to make sure that developers understand the basics, and are familiar with the architecture of a new system. Single thread performance with the Intel Xeon Phi processor is significantly better than previous designs. In addition, in order to speed up performance even more, vector processing, where applicable is critical in application performance. With two vector processing units (VPUs) per core, applications can execute two 512-bit vector multiply-add instructions per cycle. Each of these cores can deliver 32 double precision operations per clock cycle. The VPU executes all of the floating point operations as well as legacy instructions from SSE to AVX to the new AVX-512 instructions.”
While there have been previous generations of AVX instructions, the AVX-512 instructions can significantly assist the performance of HPC applications. “The new AVX-512 instructions have been designed with developers in mind. High level languages that are used for HPC applications, such as FORTRAN and C/C++, through a compiler will be able to use the new instructions. This can be accomplished through the use of pragmas to direct the compilers to generate the new instructions, or users can use libraries which are tuned to the new technology.”
For decades, Intel has been enabling insight and discovery through its technologies and contributions to parallel computing and High Performance Computing (HPC). Central to the company’s most recent work in HPC is a new design philosophy for clusters and supercomputers called Intel® Scalable System Framework (Intel® SSF), an approach designed to enable sustained, balanced performance as the community pushes towards the Exascale era.