Managing Lots of Tasks for Intel Xeon Phi

“OpenMP, Fortran 2008 and TBB are standards that can help to create parallel areas of an application. MKL could also be considered to be part of this family, because it uses OpenMP within the library. OpenMP is well known and has been used for quite some time and is continues to be enhanced. Some estimates are as high as 75 % of cycles used are for Fortran applications. Thus, in order to modernize some of the most significant number crunchers today, Fortran 2008 should be investigated. TBB is for C++ applications only, and does not require compiler modifications. An additional benefit to using OpenMP and Fortran 2008 is that these are standards, which allows code to be more portable.”

Programming for High Performance Processors

“Managing the work on each node can be referred to as Domain parallelism. During the run of the application, the work assigned to each node can be generally isolated from other nodes. The node can work on its own and needs little communication with other nodes to perform the work. The tools that are needed for this are MPI for the developer, but can take advantage of frameworks such as Hadoop and Spark (for big data analytics). Managing the work for each core or thread will need one level down of control. This type of work will typically invoke a large number of independent tasks that must then share data between the tasks.”

Building for the Future Aurora Supercomputer at Argonne

“Argonne National Labs has created a process to assist in moving large applications to a new system. Their current HPC system, Mira will give way to the next generation system, Aurora, which is part of the collaboration of Oak Ridge, Argonne, and Livermore (CORAL) joint procurement. Since Aurora contains technology that was not available in Mira, the challenge is to give scientists and developers access to some of the new technology, well before the new system goes online. This allows for a more productive environment once the full scale new system is up.”

Artificial Intelligence Becomes More Accessible

With the advent of heterogeneous computing systems that combine both main CPUs and connected processors that can ingest and process tremendous amounts of data and run complex algorithms, artificial intelligence (AI) technologies are beginning to take hold in a variety of industries. Massive datasets can now be used to drive innovation in industries such as autonomous driving systems, controlling power grids and combining data to arrive at a profitable decision, for example. Read how AI can now be used in various industries using the latest hardware and software.

Speed Your Application with Threading Building Blocks

With modern processors that contain a large number of cores, to get maximum performance it is necessary to structure an application to use as many cores as possible. Explicitly developing a program to do this can take a significant amount of effort. It is important to understand the science and algorithms behind the application, and then use whatever programming techniques that are available. “Intel Threaded Building Blocks (TBB) can help tremendously in the effort to achieve very high performance for the application.”

How Engility Delivers HPC

In this special feature, our own MichaelS reports on his SC16 meeting with Engility, a premier provider of integrated services for the U.S. government. “Complex High Performance Computing environments require careful planning, deep investigation into the technologies available and the ability to bring on-line a large system. Engility is uniquely positioned to work with demanding customers that require close collaboration in order to bring on-line state-of-the-art systems.”

Fast Networking for Next Generation Systems

“The Intel Omni-Path Architecture is an example of a networking system that has been designed for the Exascale era. There are many features that will enable this massive scaling of compute resources. Features and functionality are designed in at both the host and the fabric levels. This enables very large scaling when all of the components are designed together. Increased reliability is a result of integrating the CPU and fabric, which will be critical as the number of nodes expands well beyond any system in operation today. In addition, tools and software that have been designed to be installed and managed at the very large number of compute nodes that will be necessary to achieve this next level of performance.”

Optimizing Your Code for Big Data

Libraries that are tuned to the underlying hardware architecture can increase performance tremendously. Higher level libraries such at the Intel Data Analytics Acceleration Library (Intel DAAL) can assist the developer with highly tuned algorithms for data analysis as well as machine learning. Intel DAAL functions can be called within other, more comprehensive frameworks that deal with the various types of data and storage, increasing the performance and lowering the development time of a wide range of applications.

Increasing the Efficiency of Storage Systems

Have you ever wondered why your HPC installation is not performing as you had envisioned ? You ran small simulations. You spec’d out the CPU speed, the network speed and the disk drive speed. You optimized your application and are taking advantage of new architectures. But now as you scale the installation, you realize that the storage system is not performing as expected. Why ? You bought the latest disk drives and expect even better than linear performance from the last time you purchased a storage system. Read how you can get increased efficiency of your storage system.

Best Threads Per Core with Intel Xeon Phi

“When designing an application that contains many threads and less cores than threads, it is important to understand what is the optimal number of threads that should be assigned to a core. This value should be parameterized, in order to easily run tests to determine which is the optimum value for a given machine. One thread per core on the Intel Xeon Phi processor will give the highest performance per thread. When the number of threads per core is set at two or four, the individual thread performance may be lower, but the aggregate performance will be greater.”