Programming Massively Parallel Processors: A Hands-on Approach
by David B. Kirk and Wen-mei W. Hwu
Morgan Kaufmann (February 5, 2010)
I just finished reading the new book by David Kirk and Wen-mei Hwu called Programming Massively Parallel Processors. The generic title notwithstanding, readers should not come to this book expecting one of the highly theoretical and general parallel programming texts that most of us had at least some experience with in grad school. This book is very focused on one thing: teaching readers how to develop parallel applications that perform well on NVIDIA’s GPUs using NVIDIA’s CUDA language.
People learn in different ways, some responding well to a theory-based approach that only eventually gets down to implementation, and others responding well to generalization from the specific. I’m a specific kind of guy as, apparently, are the authors of this book. Kirk and Hwu wrote the book on the premise that learning the specifics of writing high performance code for GPUs with CUDA is a useful way to learn about parallel programming in general. Some suspicion of this point of view is warranted, given that the authors are both affiliated with NVIDIA (Kirk is an NVIDIA Fellow and was until 2009 the company’s chief scientist, and Hwu is principle investigator for the first NVIDIA CUDA Center of Excellence at the University of Illinois at Urbana-Champaign).
However, this book does at least as good a job at teaching general parallel principles through implementation as other, more platform-agnostic, MPI and OpenMP books I’ve read; and being tied to specific hardware gives Programming… at least one advantage those other books haven’t had. Namely, parallel programming on any HPC system is complex and targeted at specific hardware in direct proportion to the degree you care about performance, and it is precisely because it is tied to specific hardware that this book does a good job teaching that lesson alongside the more generally useful patterns for parallel programming.
Outline of the book
The book starts out introducing GPUs and parallelism, the history of GPU computing, and CUDA in general in the first three chapters. Chapter 4 presents an overview of CUDA threads, including thread scheduling and basic latency hiding techniques. Chapter 5 begins to look at the device in more detail, focusing on the CUDA memories and various techniques for organizing a computation to make the most use of the highest performing memory. Chapter 6 goes into more detail on coding for performance on the GPU, with helpful discussions and examples of techniques to hide latency in memory accesses and increase FLOPS. Chapter 7 provides a general discussion of floating point arithmetic. Chapters 8 and 9 are application case studies that walk the reader through applying some of the techniques already discussed in the context of two specific applications: MRI reconstruction, and molecular visualization. Chapter 10 attempts to generalize the heretofore GPU-centric discussion to other devices and paradigms, while Chapter 11 introduces OpenCL. The final chapter offers some comments about what the future may hold for GPUs.
Good introduction to parallelism, CUDA, and GPU programming
As an overall impression, the text is well-written, well-organized, and gives a lucid explanation of some difficult concepts. As a caveat, readers should note that this is not my first parallel programming book nor my first exposure to the concepts, but I still feel like the explanations and accompanying example code should be accessible even to beginning parallel programmers.
The book proceeds from a first easy-to-understand parallel implementation through more difficult specialized optimizations for several real-world application examples, an approach that lends relevance to the lessons. Each of the optimizations is described in prose and via accompanying pseudo-code, which keeps the pace lively but still offers the benefit of concretizing the discussion with something close to working examples. The techniques introduced along the way — memory coalescing, block decomposition, tiling, loop fusion, and so on — are techniques that are part of any parallel programmer’s tool chest, and are vital to learn. Naturally the implementation details in CUDA for a GPU will not bear much resemblance to what a programmer will find on an IBM Blue Gene or Cray XT5, but the authors are careful to concentrate not just on writing the CUDA code to make a particular optimization work (though that is covered), but also on developing the “computational thinking” that readers need to understand why the techniques work, and to spot when they’ll be useful in other settings.
A hallmark of application performance optimization is understanding how features of the hardware will influence the decision to select a particular algorithmic approach over others (however much we might wish for a robust infrastructure to optimize code without binding it to specific hardware). Programming… does a good job of describing the Tesla hardware architecture, and of demonstrating how this knowledge should be integrated into the programming process during application design. Experienced readers will recognize the thought process the authors describe, and beginning parallel programmers will have a solid example to serve as a go-by as they migrate to other platforms throughout the course of their career.
The book includes a chapter (chapter 10) on “computational thinking” in which the authors attempt to take a step back from their heads-down CUDA focus and provide a more generalized context for the information already presented. A chapter on OpenCL is also included. Still, in approaching this book it is important to remember that the text is primarily an introduction to parallel programming using CUDA, and a beginner will need exposure to other texts to round out his or her understanding. Fortunately the patterns and applications that Programming… relies upon to teach its lessons are fairly common, and as readers explore other material they will find ample opportunity to learn even more by directly comparing implementations in other programming approaches and for other parallel architectures to those presented in this book.
The last word
As a beginning text this book has a significant advantage that beginning texts written for MPI, OpenMP, and so on don’t have: there are 200 million CUDA-capable GPUs already deployed, and the odds are pretty good that most readers either have, or can readily get access to, a computer on which they can meaningfully learn parallel programming. If you are new to parallel programming and have access to a Tesla GPU, this book is a fine place to start your education. Readers already comfortable with parallel programming will find clear explanations of the Tesla GPU architecture and the performance implications of its hardware features, as well as a solid introduction to the principles of programming in CUDA, though they’ll probably do a lot of skimming over the already-familiar basics.
Be sure to check out the other book reviews we’ve done here at insideHPC.