Debugging Exascale: To heck with the complexity, full speed ahead!

Print Friendly, PDF & Email

Being in the right place, at the right time is certainly a key to success. So, perhaps it’s fair to ask if tools to debug exascale applications will significantly lag the availability of the new architectures, delaying the broader usefulness of a precious resource. After all, a great many things have to be in place for efficient debugging to be made possible. Debugger developers will need to be invited to the table at the earliest stages in order to make their requirements known. They require access to the architecture and need to know the number and nature of the cores and the role of special purpose processors. They need to know the programming model and require a compiler and OS that have the right hooks to the debugger itself, and more. (See sidebar “What is needed?”). Beyond these “basics” are the formidable tasks of ensuring acceptable performance and creating interfaces that make the tool useful, perhaps even intuitive, providing assistance in the interpretation of unprecedented complexity.

Even given these requirements and the early stage of exascale development, developers are moving ahead with debugger concepts in the hopes of arriving at the station in time to help the exascale ultra-express depart on schedule. In this article, we ask what an exascale debugger might look like. Will it really be different from your now average, run of the mill, petascale debugger, or the charmingly old-fashioned terascale one your wacky uncle Lou used to let you play with in the lab on the weekends? How will new tools reduce the complexity of million-way parallelism to interfaces and displays that we simple humans can comprehend and manage? And what sort of tool can run efficiently enough at scale without adding unbearable expense?

Can we stamp out debugging in our lifetime?

If you read a lot of science fiction, you know that with vision, technology, (and the total suspension of disbelief), anything is possible. So, perhaps we should ask the question “Do we have an opportunity here to develop systems that do not require debuggers?”

Not much chance, it seems. The changes in both scale and architecture suggest the appearance of new bugs, even in tried and true codes that have been meticulously ported from petascale.

Image of Chris Gottbrath

Chris Gottbrath, Rogue Wave

“Unless we think that petascale applications are going to run unchanged on exascale systems, there will need to be new code written as code is refactored and extended to be more scalable environments, and/or to run in new hardware,” says Chris Gottbrath, longtime Principal Product Manager for TotalView, now part of Rogue Wave Software. “Experience seems to indicate that pushing to higher scales can affect data layouts, message sizes, timing, or any other part of the program into domains that can expose bugs which may have been present in the program at smaller scales, but simply didn’t manifest in troubling ways.”

Dong Ahn, a computer scientist who is involved in a number of debugger development initiatives at Lawrence Livermore National Labs (LLNL), agrees that an exascale world without debuggers is not yet science, but fiction. “It is hard to imagine,” he says. “Although new applications will emerge, which can take a better advantage of fault tolerance techniques and thus be less prone to errors, a large number of today’s applications will also utilize exascale computing platforms. As today’s applications have discovered scale-related defects whenever they scaled to a higher level of concurrency, they will encounter a set of yet uncovered faults when scaled to exascale computing. It is very difficult for programmers to observe them at a reduced scale. In addition, human programmers will add new code in the form of enhancements or bug fixes. The reality is that there will be errors in the code no matter how hard they try.”

Humans make mistakes no matter what. Perhaps advanced fault tolerance or new statistical techniques will emerge to obviate the need for debuggers altogether. Surely that’s in the pipeline but, unfortunately, Dong Ahn thinks it may be quite a while.

Image of Dong Ahn

Dong Ahn, LLNL

He notes, “It is true that the software engineering community has actively investigated promising statistical techniques that can automatically diagnose root causes of observed errors. But it will take a substantial amount of time and effort before the techniques will be effective in practice on exascale computing platforms. While these techniques will significantly shorten the bug diagnosing process when mature, they will never catch all classes of bugs once and for all.”

Debugging fundamentals may not change — with no reprieve from the sentence of debugging, the question becomes “What kind of debugger is suitable for exascale?” Another might be “Can we please have one on or before the install date of our shiny new exaFLOPs hardware and power supply?”

The best answer is, “that depends.” It is safe to say that exascale will require game changing advances in virtually every element, including processors, interconnects, memory, storage, and programming models. But some aspects of debugging will likely endure.

Observes Chris Gottbrath, “Software bugs will still often boil down to some of the same things that we frequently encounter now: data issues, program logic issues, and concurrency issues. If that is true then exascale-capable debuggers are ultimately going to present to the user their code and a picture of how the program is actually working — a view into variables and data structures and how those data elements are transformed. Bugs may still boil down to a single bit error in a single place in a vast parallel job. So debugger ‘fundamentals’ will continue to be important.”

Image of David Lecomber

David Lecomber, Allinea

With fundamentals in place, David Lecomber, CTO at Allinea software, feels today’s user will recognize an exascale debugger. “Will an exascale debugger still be recognizably like a debugger today? I should think so — automation will help guide you to a problem, but the fundamental components of debugging — like source code, variables, and breakpoints will be there when you get to that problem.  The programming models and languages will have a greater impact on defining what the debugger looks like than the actual number of cores.”

Lecomber, who stresses that automation may be the biggest difference in the next generation of debugging tools, adds, “Automation is one of the major opportunities — identifying differences in behavior of threads, processes or data is something that tools can do easier than people. With scalable infrastructure underneath you suddenly open up a lot of new possibilities.”

Exascale problem solving may still involve the identification of the droplet of a single bit error in a sea of processes and threads, and current debugger technology may be persuaded to scale. However, effective interfaces and the ability to frame the query must be defined.

Image of Michael Wolfe

Michael Wolfe, PGI

“Debugger vendors Allinea and TotalView have demonstrated debugging at the extreme level: 100,000+ cores,” says Michael Wolfe, Compiler Engineer at PGI and a Fellow at ST Microelectronics. “It’s likely that even at exascale, the number of nodes will be around 100,000, with some large number of ‘cores’ at each node, so the underlying debugger technology will scale. Querying and displaying information at that level remains a challenge.”

Two themes emerge in the debugger discussion: Interactivity vs. batch, and interfaces that people can comprehend and act on. What’s needed is the ability to see the big picture and drill down to useful details without getting lost or going quite mad in the labyrinthine exa-maze of threads and processes.

“Interactive debugging at the exascale may be folly”

Does single stepping at exascale make sense? Michael Wolfe thinks not and goes as far as saying, “Interactive debugging at the exascale may be folly. Who can afford to tie up such an expensive resource while you sit at the keyboard, clicking, winding back through the stack, looking at variables, etc.”

Dong Ahn paints a similar picture. “As the number of debugged processes increases, memory and performance overheads to support high-level debugging operations such as single-stepping and variable-inspecting are becoming increasingly expensive. The volume of debug data gathered from many processes becomes prohibitively large. Often the gathered data will not fit into the debugger’s ‘front-end’ that directly interfaces with users. Even if the front-end can manage the large volume, the performance overheads of analyzing the gathered data will become prohibitively expensive.

“Equally challenging is to trace a very large number of threads of control within the time constraint that users would tolerate,” he adds. “Users will not adopt a debugger that requires a lunch break only to single-step the application’s execution.”

Even if something resembling current debugger technology is made to scale, decisions must be made about what functionality, and therefore overhead can be tolerated. Therefore, the discussion of “lightweight debuggers” is on the front burner.

Lightweight debugging has advantages

Like a climber’s ultra-light gear, a lightweight debugger allows the developer to take the bare essentials along when scaling the heights. Left at the base camp may be line-by-line stepping, source code and other displays in favor of core tools such as stack back-trace analysis and perhaps something akin to a simplified MPI message queue display.

Dong Ahn has been experimenting with the STAT lightweight debugger on large-scale systems. “My development experience with both STAT and a fully-featured debugger has convinced me that a lightweight tool like STAT has significant advantages over traditional fully-featured debuggers in both achieving and sustaining high scalability. It is a focused tool that does not need to support a large set of features. Further, the contained core functionality has also provided a relatively easy path to sustain the achieved scalability. [It] does not impose high memory, network bandwidth, and I/O requirements on the same resources that the target job is using. Thus, the tool is poised well for production problems that can easily disappear if perturbed significantly.”

He is quick to point out that, in this case, less can certainly be more, commenting, “Focused functionality does not mean the capability is inferior. In contrast, it has been more capable in terms of identifying a handful of representative processes of a large scale run. Taken together, these advantages have led STAT to run successfully with up to 212,992 processes on a relatively short span, and to address many elusive bugs on the high-end computing systems.”

Though a lightweight debugger may answer many of the expense and overhead issues in exascale, they can’t do the entire job alone. Chris Gottbrath reminds us, “In any case you still need a full scale capability eventually.”

Dong Ahn also sees the need for a family of debugging tools that communicates well with its peers and offers a full set of features. “Lightweight tools and fully-featured debuggers should build on each other’s strengths. There are many classes of bugs that fully-featured debuggers cannot resolve within certain time constraint; there are equally many classes of bugs [for which a] lightweight debugger cannot analyze root causes. Combined well, they will provide a comprehensive debugging environment for exascale computing.”

Does a return to batch debugging make sense?

Getting machine time for debugging on large systems can be very difficult. If you are not too demanding, ask for a small part of the machine, and agree to take whatever slots are available, eventually you can get your debugging session scheduled and hope all goes well. But how do you debug at exascale? If you ask for and get 10% of the machine for your experiment, you may be running at 100 PetaFLOPS! For a handful of critical applications, of course, the benefits outweigh the costs. For all others, finding a more efficient and accessible debugging model becomes key.

“Back in the early 1970s, IBM had developed a very sophisticated batch-mode debugger,” observes Michael Wolfe. “It was really advanced, just at the time when interactive debugging was starting to replace print and batch debugging. Perhaps we need to explore whether to resurrect that technology.”

Does a return to batch debugging make sense? This may seem totally retro, but batch debuggers have the advantage of running anytime machine priorities can accommodate. Of course, they also have the disadvantage of waiting for results that may not reveal the problem.

Chris Gottbrath agrees that batch is potentially an important part of the mix. “Debuggers will need to support highly efficient workflows to reduce the amount of time that large allocations are ‘paused’ while a user inspects a bit of code. Lightweight tracing and the ability to debug optimized or partially optimized applications will be important. Batch mode debugging, trace debugging, automated analysis, and record and replay debugging may open up the option for separating the “debugging run” from the “root cause analysis.”

Breakthroughs required for multi-thread debugging

The current petascale morass of intertwined threads may seem like a game of checkers compared to the four-dimensional chess game of exascale multi-threading and synchronization. Certain classes of errors related to threads may pose extreme challenges for new debuggers.

“The growing trends towards multi-threading on future architectures will lead to more thread-related errors such as elusive race conditions and deadlocks,” explains Dong Ahn. “Unprecedented parallelism will also expose many defects in the code that tries to represent or to rely on the global state, leading to an unexpected blowup of memory use. Reproducing a data race of a threaded execution using manual control of threads within a traditional debugger has already proven to be difficult even at a very low process and thread count.

“Unless a breakthrough technique is invented for threaded-code debugging, the exascale-capable debugger will continue to struggle on such errors,” he adds.

Usability and a scalable presentation paradigm

Perhaps most critical for end users are a short learning curve, better usability, and a scalable presentation paradigm that helps them see the state and then drill down into processes and threads without information overload.

Observes David Lecomber, “Usability is one of the main challenges to any debugging situation, whether at scale or even on a single core.  Users do not have time to learn a debugger — they only want to fix their problem and fix it quickly. Exascale will not change this!”. 

Simplifying the interface and increasing ease of use, while essential, will not be easy. Chris Gottbrath points to some of the challenges. “There will simply be too much information for developers to look at all the details in all the contexts. Exascale debuggers will need improved ways to allow developers to get an overview of the entire system so that they can decide where they want to “dive down” and look at things in more detail. Those improvements may come in the form of clever ways to display data and state or automated analysis of program state.”

The ability to dive into layers of granularity and control the volume of information presented at any given time may be the key to usability. The level of detail (LOD) concept may hold promise for dealing with huge amounts of detail.

Dong Ahn notes, “Presenting debug information in terms of individual processes and threads already overwhelms users at today’s moderate scale. Thus, the scheme will undoubtedly fail on exascale applications. Even if a technique uses one pixel per process, an ordinary screen real estate will disallow rendering of visuals for million-way, if not billion-way, parallelism.”

A promising scalable paradigm that I like as debug information visualization borrows from the LOD concept. Similarly to computer graphics, LOD techniques that decrease the complexity of 3D object representation as the viewer moves away from it. Such presentation techniques progressively coarsen the complexity of debug data representation as the users inspect progressively larger numbers of processes and threads.

For example, at a coarser level, LOD techniques would use group-based idioms to represent debug information. They would create equivalence groups of processes by relating processes that behave similarly; they would also relate the equivalence groups through the degree of control flow or data dependencies between members of distinct groups. The coarse-grained macro-information would then give the users insights that are just good enough to focus their attention on a much-reduced set of groups. Ultimately, the users will be down to several processes and threads to that they can apply a rich set of detailed debugging techniques already available for root cause analysis.

Tools development typically lags the release of new architectures, operating systems and compilers. Debugger developers need access at the earliest possible stages to new software stacks and hardware in order to ensure timely release of theses critically important tools. However, the problem goes beyond the technical. The business model for debugger development has always been challenging at the high end. Deep expertise and long focus is required, and that costs a good deal of money and motivation. The exascale community would do well to remember that highly scalable tools may be more likely to come from the few long-term vendors who have committed to this space, in close collaboration (read access and funding) from their largest customers around the world.

That said, even though it seems a challenging task to produce great tools in a timely fashion for exascale, David Lecomber reminds us that there is cause for optimism. “It is not long ago that people were questioning whether a real full machine petascale debugger could ever work,” he says.  “In the work we have done to support petascale, we’ve made a debugger that is as easy to use at 250,000 cores as it is at 100.  The GUI does not become more complicated if you add more cores —it groups things together, it aggregates common attributes, and it highlights differences.”

The observations made by experts in this article point to both the challenge and the promise of a suite of scalable, easier to use debugging tools that may empower the entire HPC community and boost the efficiency of application development. Research on a new generation of tools is gaining momentum within the global exascale community. With proper road maps in place, a continued focus on collaboration and a consistent funding model, there may be hope that the software stack and a great suite of tools will arrive at the same moment as exascale architectures.

What is needed to enable debugger development?

  • A Consistent Execution Model. At the lowest level, the OS and hardware need to provide a consistent execution model, so you can pause the execution and be sure that the state is present and visible.
  • A useful exception model within the architecture. When things go wrong, like a divide by zero, the architecture should be able to pause and let the user know what is going on.
  • The ability to create a mapping between what the programmers wrote in the code and what is actually happening on the machine. This facility is generally provided by the debugger — through debug information including a symbol table and a record of what line numbers go with what instructions. These days mapping info is provided by the compiler, but if some other component will be doing the translation, it will have to be able to tell the debugger about the transformation the code goes through from when the user wrote it to what is actually happening on the machine.
  • Hooks provided by a tracing layer within the operating system. This allows the debugger to interact with the program the user is running, doing things like starting and stopping it and both inspecting and changing memory at a low level.
  • Hooks to interact with threads. The debugger needs to be able to discover what threads are running, query and control them individually.
  • Interfaces for accelerator processors. Accelerator architectures may require special attention and different interfaces.
  • Support within exascale resource management mechanisms for allocating compute resources to accomplish the debugging.