Intel series on developing multithreaded applications

Intel Software logooday the Parallel Programming Community on the Intel Software Network is publishing a collection of technical papers to provide developers with additional support as they are trying to learn, or improve the use of, Intel’s large tool suite for parallel programming. I like the fact that they are publishing as a series of short papers, rather than a monolithic book format, because it provides developers more direct entry into just the content they need.

There are 25 papers in the series so far covering a wide range of topics from broad subjects (i.e., Granularity and Parallel Performance or Automatic Parallelization with Intel Compilers) to very narrowly focused guidance (i.e., Avoiding and Identifying False Sharing Among Threads). Taken as a whole, the series is designed so that an application developer can use the lessons and insight to improve multithreading performance on current and future Intel architectures (the whole idea of scaling forward).

Intel’s Aaron Tersteeg generously offered to give me a sneak peak at three of the papers in the series: Getting Code Ready for Parallel Execution with Intel Parallel Composer, Curing Thread Imbalance Using Intel Parallel Amplifier, and  Using Intel Parallel Inspector to Find Race Conditions in OpenMP-based Multithreaded Code. I picked these three papers because they all relate to some of the exciting front end work that Intel is doing to build tools that will enable the non-parallel specialist to develop effective parallel applications.

Getting Code Ready for Parallel Execution with Intel Parallel Composer

Parallel Composer is part of Intel’s Microsoft Visual Studio add-on suite for parallel application development called Parallel Studio, which began shipping last May (more on Parallel Studio here). Parallel Composer is where code gets written in Parallel Studio, and it builds directly upon Intel’s existing code development tools. This article provides an overview of the different approaches supported by Parallel Composer for expressing concurrency in applications: OpenMP, C++ Compiler Language Extensions (i.e., __par,__critical, etc.), Threading Building Blocks, Win32 Threading API and Pthreads, Threaded Libraries (like the Intel Math Kernel Library, MKL), auto-parallelization, and auto-vectorization.

In addition to providing a quick overview of each approach along with examples that serve to highlight the type of code resulting from each of the approaches, the paper also provides some quick insight into specific situations where one approach may be preferable to the others to help developers make the right choice. In general, the advice is balanced and honest

As a compiler-based threading method, OpenMP provides a high-level interface to the underlying thread libraries. With OpenMP, the programmer uses OpenMP directives to describe parallelism to the compiler. This approach removes much of the complexity of explicit threading methods, because the compiler handles the details. Due to the incremental approach to parallelism, where the serial structure of the application stays intact, there are no significant source code modifications necessary. A non-OpenMP compiler simply ignores the OpenMP directives, leaving the underlying serial code intact.

With OpenMP, however, much of the fine control over threads is lost. Among other things, OpenMP does not give the programmer a way to set thread priorities or perform event-based or inter-process synchronization.

Curing Thread Imbalance Using Intel Parallel Amplifier

Parallel Amplifier is another component of Parallel Studio. Amplifier builds upon a technology proof of concept that Intel posted at WhatIf.Intel.com some time ago, VTune. VTune is powerful, and despite being hard to use quickly became the most popular download at the site, even for developers within Intel. Amplifier builds on this tool and extends the design to support non-experts, for example incorporating visualization to help developers understand what’s going on with their codes. Amplifier is specifically targeted at improving the performance the portion of an application running on a multicore socket.

This paper covers a specific use case for Parallel Amplifier: finding and fixing application load imbalance. Load imbalances are created when one or more threads have more work to do than the others, leaving some threads sitting idle while others are working.

Intel Parallel Amplifier…assists in fine-tuning parallel applications for optimal performance on multicore processors. Intel Parallel Amplifier makes it simple to quickly find multicore performance bottlenecks and can help developers speed up the process of identifying and fixing such problems. Achieving perfect load balance is non-trivial and depends on the parallelism within the application, workload, and the threading implementation.

This paper presents its concepts in the context of simple-to-understand example code, reasoning through the information provided to the developer by Amplifier in order to develop the critical analysis skills necessary to write efficient multicore code.

The concurrency analysis reveals that the CPU utilization on the same routine is poor (Figure 2) and the application uses 2.28 cores on average (Figure 3). The main hotspot is not utilizing all available cores; the CPU utilization is either poor (utilizing only one core) or OK (utilizing two to three cores) most of the time. The next question is whether there are any load imbalances that are contributing to the poor CPU utilization. The easiest way to find the answer is to select either Function-Thread-Bottom-up Tree or Thread-Function-Bottom-up Tree as the new granularity, as shown in Figure 4.

Using Intel Parallel Inspector to Find Race Conditions in OpenMP-based Multithreaded Code

Parallel Inspector is the focal point for application debugging in Parallel Studio. It is based on Intel Thread Checker, and Intel’s James Reinders has described it to me in the past as a “proactive bug finder.” Inspector will try to find problems that haven’t yet manifested as bugs by looking for patterns that indicate data races, deadlocks, and other usage errors that often don’t appear for long periods after release, or only show up in unpredictable ways. Parallel Inspector is used to debug multithreading errors in applications that use the Win32, Intel Threading Building Blocks or OpenMP threading models. This paper emphasizes a common use case for just one of those models, finding race conditions in OpenMP.

Again the paper is extremely practical, using simple but relevant sample code to motivate a discussion of how to use the tool to find and fix a very common problem encountered by developers of parallel applications. Although the paper focuses on OpenMP it doesn’t actually assume much prior knowledge OpenMP programming, taking the time to explain the basic work sharing construct used in the example code to make the paper relevant even to those just getting started.

In Figure 1, Intel Parallel Inspector identifies the data race errors against the source line where x variable is modified, as well as the next source line with supping up the partial sums for each iteration. These errors are quite evident, as the globally defined variables x and sum are being accessed for read and write from the different threads. In addition, Intel Parallel Inspector produces the ‘Potential privacy infringement’ warning, which indicates that a variable allocated on the stack of the main thread was accessed in the worker threads.

…Once the error report is obtained and the root cause is identified with the help of Intel Parallel Inspector, developers can consider approaches to fixing the problems. General considerations for avoiding data race conditions in parallel OpenMP loops are given below, along with advice about how to fix problems in the examined code.

Summing up

I found these papers, and the whole series in general, to be quite helpful, striking a good balance between brevity and completeness. You won’t walk away from this series with an encyclopedic understanding of any one concept or tool, but then that isn’t the point. The focus in each is on getting to the core of a particular tool or concept, or on solving a particular problem. The papers are all very practically-oriented, adding only enough background and theory to provide some minimum foundation upon which to learn. This approach enables the series to be accessible and relevant to developers intimidated by large documentation sets or who just need enough information to get them started solving a particular problem they are having at a particular time while providing a solid foundation for self-paced learning in more detail.

If you are developing multicore applications on Intel processors, these are worth at least a quick review to familiarize yourself with what’s there. When you find yourself stuck in the future, you’ll know right where to go.

Trackbacks

  1. […] about The Art of Snow Blowing as it relates to the paper Granularity and Parallel Performance. InsideHPC wrote up a review of The Guide stating that they found the papers "to  be quite helpful, striking a good balance between brevity […]

  2. […] about The Art of Snow Blowing as it relates to the paper Granularity and Parallel Performance. InsideHPC wrote up a review of The Guide stating that they found the papers “to  be quite helpful, striking a good balance between […]

  3. […] Intel series on developing multithreaded applications […]