A Guide to High Performance Computing in the AWS Cloud

A new ebook from Intel and AWS explores how AWS cloud can mitigate some of today’s top HPC challenges and enable engineers and researchers to innovate without constraints. “Most HPC processing is still conducted using on-premises systems. But for those companies unwilling to invest in new hardware to handle peak demands, moving HPC workloads to the cloud may be the answer.”

Performance Gains Using Libraries

In many cases, applications that perform various simulations use some of the same math functions that many other applications use. Rather than each developer recoding the same math functions over and over, libraries, developed by experts can significantly speed up execution of the overall application. Since there can be many optimizations that experts who understand many of the nuances of the hardware would understand, it is important that developers be familiar with various libraries that are made available for HPC types of applications.

Vectorization with AVX-512 Intrinsics

“With the Intel compilers, intrinsics are recognized and the instructions are generated in-line which is a tremendous advantage. Since the Intel Xeon Phi processor when using the AVX-512 intrinsics can perform a tremendous number of floating point operations per second, it is beneficial to use intrinsics for certain math computations. To use intrinsics, all that is needed is the proper header file and then to call the desired intrinsic function.”

Best Threads Per Core with Intel Xeon Phi

“When designing an application that contains many threads and less cores than threads, it is important to understand what is the optimal number of threads that should be assigned to a core. This value should be parameterized, in order to easily run tests to determine which is the optimum value for a given machine. One thread per core on the Intel Xeon Phi processor will give the highest performance per thread. When the number of threads per core is set at two or four, the individual thread performance may be lower, but the aggregate performance will be greater.”

Memory Modes For Increased Performance on Intel Xeon Phi

The Intel Xeon Phi processor supports different types of memory, and can organize this into three types of memory mode. The new processor from Intel contains two type of memory, MCDRAM and DDR memory. These different memory subsystems are complimentary but can be used in different ways, depending on the application that is being executed. “By using these two types of memory in the same system gives flexibility to the overall system and will show an increase in performance for almost any application.”

Maximize Parallelization with Threading On A Core

To get maximum parallelization for an application, not only must the application be developed to take advantage of multiple cores, but should also have the code in place to keep a number of threads working on each core. A modern processor architecture, such as the Intel Xeon Phi processor, can accommodate at least 4 threads for each core. “On the Intel Xeon Phi processor, each of the threads per core is known as a hyper-thread. In this architecture, all of the threads on a core progress through the pipeline simultaneously, producing results much more quickly than if just one thread was used. The processor decides which thread should progress, based on a number of factors, such as waiting for data from memory, instruction availability, and stalls.”

Intel Xeon Phi Processor: A Look at the Basic Architecture

“Designing a new generation of hardware with such high performance needs to make sure that developers understand the basics, and are familiar with the architecture of a new system. Single thread performance with the Intel Xeon Phi processor is significantly better than previous designs. In addition, in order to speed up performance even more, vector processing, where applicable is critical in application performance. With two vector processing units (VPUs) per core, applications can execute two 512-bit vector multiply-add instructions per cycle. Each of these cores can deliver 32 double precision operations per clock cycle. The VPU executes all of the floating point operations as well as legacy instructions from SSE to AVX to the new AVX-512 instructions.”

New AVX-512 Instructions Boost Performance on Intel Xeon Phi

While there have been previous generations of AVX instructions, the AVX-512 instructions can significantly assist the performance of HPC applications. “The new AVX-512 instructions have been designed with developers in mind. High level languages that are used for HPC applications, such as FORTRAN and C/C++, through a compiler will be able to use the new instructions. This can be accomplished through the use of pragmas to direct the compilers to generate the new instructions, or users can use libraries which are tuned to the new technology.”

Intel® Xeon Phi™ Processor—Highly Parallel Computing Engine for HPC

For decades, Intel has been enabling insight and discovery through its technologies and contributions to parallel computing and High Performance Computing (HPC). Central to the company’s most recent work in HPC is a new design philosophy for clusters and supercomputers called Intel® Scalable System Framework (Intel® SSF), an approach designed to enable sustained, balanced performance as the community pushes towards the Exascale era.