With modern processors that contain a large number of cores, to get maximum performance it is necessary to structure an application to use as many cores as possible. Explicitly developing a program to do this can take a significant amount of effort. It is important to understand the science and algorithms behind the application, and then use whatever programming techniques that are available. “Intel Threaded Building Blocks (TBB) can help tremendously in the effort to achieve very high performance for the application.”
“When designing an application that contains many threads and less cores than threads, it is important to understand what is the optimal number of threads that should be assigned to a core. This value should be parameterized, in order to easily run tests to determine which is the optimum value for a given machine. One thread per core on the Intel Xeon Phi processor will give the highest performance per thread. When the number of threads per core is set at two or four, the individual thread performance may be lower, but the aggregate performance will be greater.”
Accelerated computing continues to gain momentum as the HPC community moves towards Exascale. Our recent Tesla P100 GPU review shows how these accelerators are opening up new worlds of performance vs. traditional CPU-based systems and even vs. NVIDIA’s previous K80 GPU product. We’ve got benchmarks, case studies, and more in the insideHPC Research Report on GPU Accelerators.
As data center sprawl is now understood to be expensive and may not deliver performance increases for all types of applications, new technologies are coming to the rescue. A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturing – hence “field-programmable”. While the use of GPUs and HPC accelerators are generally understood today, there are a number of misconceptions about FPGAs that need to be understood.
The Intel Xeon Phi processor supports different types of memory, and can organize this into three types of memory mode. The new processor from Intel contains two type of memory, MCDRAM and DDR memory. These different memory subsystems are complimentary but can be used in different ways, depending on the application that is being executed. “By using these two types of memory in the same system gives flexibility to the overall system and will show an increase in performance for almost any application.”
FPGAs will become increasing important for organizations that have a wide range of applications that can benefit from performance increases. Rather than a brute force method to increasing performance in a data center by purchasing and maintaining racks of hardware and associated costs, FPGAs may be able to equal and exceed the performance of additional servers, while reducing costs as well.
To get maximum parallelization for an application, not only must the application be developed to take advantage of multiple cores, but should also have the code in place to keep a number of threads working on each core. A modern processor architecture, such as the Intel Xeon Phi processor, can accommodate at least 4 threads for each core. “On the Intel Xeon Phi processor, each of the threads per core is known as a hyper-thread. In this architecture, all of the threads on a core progress through the pipeline simultaneously, producing results much more quickly than if just one thread was used. The processor decides which thread should progress, based on a number of factors, such as waiting for data from memory, instruction availability, and stalls.”
“Designing a new generation of hardware with such high performance needs to make sure that developers understand the basics, and are familiar with the architecture of a new system. Single thread performance with the Intel Xeon Phi processor is significantly better than previous designs. In addition, in order to speed up performance even more, vector processing, where applicable is critical in application performance. With two vector processing units (VPUs) per core, applications can execute two 512-bit vector multiply-add instructions per cycle. Each of these cores can deliver 32 double precision operations per clock cycle. The VPU executes all of the floating point operations as well as legacy instructions from SSE to AVX to the new AVX-512 instructions.”
Since 2008, the Intel and Cray have rapidly increased their collaboration to the benefit of the supercomputing market and customers. “Most recently, Cray has announced win after win for its Cray XC series systems that feature the Intel Xeon Phi processor, code-named Knights Landing and Knights Hill, which offers peak performance of over half-a-petaflop per cabinet—a 2X performance boost over previous generations. Cray is leading the charge toward many-core-CPU systems that boost application performance without the aid of GPUs.”
While there have been previous generations of AVX instructions, the AVX-512 instructions can significantly assist the performance of HPC applications. “The new AVX-512 instructions have been designed with developers in mind. High level languages that are used for HPC applications, such as FORTRAN and C/C++, through a compiler will be able to use the new instructions. This can be accomplished through the use of pragmas to direct the compilers to generate the new instructions, or users can use libraries which are tuned to the new technology.”