Sign up for our newsletter and get the latest HPC news and analysis.
Send me information from insideHPC:


Articles and news on parallel programming and code modernization

Internode Programming With MPI and Intel Xeon Phi Processor

“While MPI was originally developed for general purpose CPUs and is widely used in the HPC space in this capacity, MPI applications can also be developed and then deployed with the Intel Xeon Phi Processor. With the understanding of the algorithms that are used for a specific application, tremendous performance can be achieved by using a combination of OpenMP and MPI.”

More Than Ever, Vectorization and Multithreading are Essential for Performance

Employing a hybrid of MPI across nodes in a cluster, multithreading with OpenMP* on each node, and vectorization of loops within each thread results in multiple performance gains. In fact, most application codes will run slower on the latest supercomputers if they run purely sequentially. This means that adding multithreading and vectorization to applications is now essential for running efficiently on the latest architectures.

OSC Hosts fifth MVAPICH Users Group

A broad array of system administrators, developers, researchers and students who share an interest in the MVAPICH open-source library for high performance computing will gather this week for the fifth meeting of the MVAPICH Users Group (MUG). “Dr. Panda’s library is a cornerstone for HPC machines around the world, including OSC’s systems and many of the Top 500,” said Dave Hudak, Ph.D., interim executive director of OSC. “We’ve gained a lot of insight and expertise from partnering with DK and his research group throughout the years.”

Feed The Cores – Memory Bandwidth Usage

“Memory bandwidth to the CPUs has always been important. There were typically CPU cores that would wait for the data (if not in cache) from main memory. However, with the advanced capabilities of the Intel Xeon Phi processor, there are new concepts to understand and take advantage of.”

3X Performance Boost Using Intel Advisor and Intel Trace Analyzer in Astrophysics Simulations

On today’s processors, it is crucial to both vectorize (using AVX* or SIMD* instructions) and parallelize software to realize the full performance potential of the processor. By optimizing their MHD astrophysics applications with tools from Intel Parallel Studio XE, and running on the latest Intel hardware, the NSU team achieved a performance speed-up of 3X, cutting the standard time for calculating one problem from one week to just two days.

Speeding Up Big Data Analysis With Intel MKL and Intel DAAL

“New algorithms that can query massive amounts of data an draw conclusions have been developed, but these algorithms need to be optimized on the underlying hardware. This is where the expertise of vendors who develop the hardware can add tremendous value. Optimizing the underlying libraries that can execute with a high degree of parallelism will definitely lead to improved performance for the software and productivity gains for the organization.”

Go with Intel® Data Analytics Acceleration Library and Go*

Use of the Go* programming language and it’s developer community has grown significantly since it’s official launch by Google in 2009. Like many popular programming languages (C and Java come to mind), Go started as an experiment to design a new programming language that would fix some of the common problems of other languages and yet stay true to the basic tenets of modern programming: be scalable, productive, readable, enable robust development environments, and support networking and multiprocessing.

Cycles Per Instruction – Why it matters

To compare how one version of a part of the code is running to another version, since this is a ratio, it is important to keep one of the values constant in order to understand if the optimization is working. If more cpu cycles are being used, but more instructions are being executed, then the ratio could be the same, but this measure will not show any improvement. The goal is to lower the CPI in certain parts of the code as well as the overall application.

Performance Gains Using Libraries

In many cases, applications that perform various simulations use some of the same math functions that many other applications use. Rather than each developer recoding the same math functions over and over, libraries, developed by experts can significantly speed up execution of the overall application. Since there can be many optimizations that experts who understand many of the nuances of the hardware would understand, it is important that developers be familiar with various libraries that are made available for HPC types of applications.

Deep Learning Frameworks Get a Performance Benefit from Intel MKL Matrix-Matrix Multiplication

Intel® Math Kernel Library 2017 (Intel® MKL 2017) includes new GEMM kernels that are optimized for various skewed matrix sizes. The new kernels take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) and achieves high GEMM performance on multicore and many-core Intel® architectures, particularly for situations arising from deep neural networks..