“An interesting aspect to prefetching is the distance ahead of the data that is being used to prefetch more data. This is a critical parameter for success and can be defined as how many iterations ahead to issue a prefetch instruction, and can be referred to as the distance. A compiler will automatically determine the distance to prefetch, and can be determined by looking at the compiler optimization reports.”
“Prefetching on a coprocessor such as the Intel Xeon Phi coprocessor can be more important than on a main CPU such as the Intel Xeon CPUs. Since the cores on the Intel Xeon Phi coprocessor are in-order, they cannot hide memory latency as compared to an out-of-order CPU. In addition, since a coprocessor does not have an L3 cache, L2 misses must then access the slower memory subsystem.”
“In a heterogeneous system that combines both the Intel Xeon CPU and the Intel Xeon Phi coprocessor, there are various options available to optimize applications. Whether one has an advantage over another is somewhat dependent on the application that is being run. Comparisons can be made comparing the two methods, as long as the algorithm lends itself to run and take advantage of either OpenMP or OpenCL.”
“Intel has incorporated Intel Solutions for Lustre Software as part of the Intel SSF because it provides the performance to move data and minimize storage bottlenecks. Lustre is also open source based, and already enjoys a wide foundation of deployments in research around the world, while gaining significant traction in enterprise HPC. Intel’s version of Lustre delivers a high-performance storage solution in the Intel SSF that next-generation HPC needs to move toward the era of Exascale.”
“The combination of using both MPI and OpenMP is a topic that has been explored by many developers in order to determine the most optimum solution. Whether to use OpenMP for outer loops and MPI within, or by creating separate MPI processes and using OpenMP within can lead to various levels of performance. In most cases of determining which method will yield the best results will involve a deep understanding of the application, and not just rearranging directives.”
As multi-socket, then multi-core systems have become the standard, the Message Passing Interface (MPI) has become one of the most popular programming models for applications that can run in parallel using many sockets and cores. Shared memory programming interfaces, such as OpenMP, have allowed developers to take advantage of systems that combine many individual servers and shared memory within the server itself. However, two different programming models have been used at the same time. The MPI 3.0 standard allows for a new MPI interprocess shared memory extension (MPI SHM).
In this week’s industry Perspective, Katie Garrison of One Stop Systems explains how GPUltima allows HPC professionals to create a highly dense compute platform that delivers a petaflop of performance at greatly reduced cost and space requirements.compute power needed to quickly process the amount of data generated in intensive applications.
Matrix multiplies can be decomposed into tiles and executed very fast on the latest generations of coprocessors. Intel has developed the hStreams library that supports task concurrency on heterogeneous platforms. The concurrency may be across nodes (Xeon, KNC, KNL-SB, KNL-LB); within a node for small matrix operations; and in the overlapping of computation and communication, particularly for tiled solutions. It relieves the user of complexity in dealing with thread affinitization, offloading, memory types, and memory affinitization.
Although liquid cooling is considered by many to be the future for data centers, the fact remains that there are some who do not yet need to make a full transformation to liquid cooling. Others are restricted until the next budget cycle. Whatever the reason, new technologies like Internal Loop are more affordable than liquid cooling and can replaces less efficient air coolers. This enables HPC data centers to still utilize the highest performing CPUs and GPUs.
An interesting use of HPC technologies is in the area of understanding the propagation of radio frequency energy in an outdoor environment. “Applications of this type need to be completed in seconds to minutes to be useful. Since the tracing of each ray is independent of another ray, this type of application can be distributed easily among the many cores of the Intel Xeon Phi coprocessor.”