The Intel Omni-Path Architecture (Intel® OPA) whitepaper goes through the multitude of improvements that Intel OPA technology provides to the HPC community. In particular, HPC readers will appreciate how collective operations can be optimized based on message size, collective communicator size and topology using the point-to-point send and receive primitives.
Applications that use 3D Finite Difference (3DFD) calculations are numerically intensive and can be optimized quite heavily to take advantage of accelerators that are available in today’s systems. The performance of an implementation can and should be optimized using numerical stencils. Choices made when designing and implementing algorithms can affect the Arithmetic Intensity (AI), which is a measure of how efficient an implementation, by comparing the flops and memory access.
“OpenCL is a fairly new programming model that is designed to help programmers get the most out of a variety of processing elements in heterogeneous environments. Many benchmarks that are available have demonstrated that excellent performance can be obtained over a wide variety of devices. Rather than lock an application into one specific accelerator, by using OpenCL, applications can be run over on a number of different architectures with each showing excellent speedups over a native (host cpu) implementation.”
Many Universities, private research labs and government research agencies have begun using High Performance Computing (HPC) servers, compute accelerators and flash storage arrays to accelerate a wide array of research among disciplines in math, science and engineering. These labs utilize GPUs for parallel processing and flash memory for storing large datasets. Many universities have HPC labs that are available for students and researchers to share resources in order to analyze and store vast amounts of data more quickly.
The Embree kernel approach, using the Intel Xeon Phi coprocessor is applicable to many situations. The implementation can be tuned to the hardware available, using different vector widths and workloads per ray. With a flexible toolkit for rendering, applications can take advantage of the latest hardware acceleration to achieve maximum performance.
Companies already using High-performance Computing (HPC) with a Lustre file system for simulations, such as those in the financial, oil and gas, and manufacturing sectors, want to convert some of their HPC cycles to Big Data analytics. This puts Lustre at the core of the convergence of Big Data and HPC.
“An expanding area of work both on the hardware front and the software side is to modify and optimize applications to run on both the host processor and a coprocessor. Many techniques to transform applications to reduce runtime have been discussed and implemented across a wide variety of applications.”
The benefits of nested parallelism on highly threaded applications can be determined and quantified. With the number of cores in both the host CPU (Intel Xeon) and the coprocessor (Intel Xeon Phi) continues to increase, much thought must be given to minimizing the thread overhead when many threads need to be synchronized, as well as the memory access for each processor (core). Tasks that can be spread across an entire system to exploit the algorithm’s parallelism, should be mapped to the NUMA node to make them more efficient.
“Applications can be tuned to use both the Intel Xeon and the Intel Xeon Phi simultaneously, without modifying the code to just run on the coprocessor. Using a number of software tools from Intel, performance of a coupled cluster method can be demonstrated to gain a tremendous performance with excellent scaling.”