Over at the Xcelerit Blog, Jörg Lotze and Hicham Lahlou write that code portability is the key to success in a hybrid computing world with so many available processing architectures.
Therefore, often compromises are taken: typically easy maintenance is favoured and performance is sacrificed. That is, the code is not optimised for a particular platform and developed for a standard CPU processor, as maintaining code bases for different accelerator processors is a difficult task and the benefit is not known beforehand or does not justify the effort. The best solution however would be a single code base that is easy to maintain, written in such a way that it can run on a wide variety of hardware platforms – for example using the Xcelerit SDK. This allows to exploit hybrid hardware configurations to the best advantage and is portable to future platforms.
The new Allinea DDT 4.0 release is designed to make it easier for scientists to debug and optimize HPC code even when they’re on the road.
We’ve got a lot of academics, lab people, and industry people who are on the road a lot for conferences and meetings. It’s important for them to be able to work remotely,” says David Bernholdt, a senior computational scientist in R&D at ORNL. “Instead of having to put statements in the code, recompile, reset, and go through this whole long cycle, they’ll be able to pop up their clients on their laptops and figure out what’s going on right away.”
The release of Allinea DDT 4.0 includes native remote clients for Linux, Windows and Mac. These clients allow debugging of HPC applications, wherever they are hosted – on nationwide HPC resources or out in the rapidly growing HPC Cloud.
The new native client approach is paying dividends for users. “The advantage of a true native client is in the response times,” adds Chris January, VP Engineering at Allinea, “when you’re debugging code on a cluster you don’t want a slow connection to make you step twice or accidentally delete breakpoints. Only a native client can respond quickly enough to keep users in complete control.”
Today Rogue Wave Software announced that TotalView has been selected by both the University of Luxembourg and University of Strasbourg to debug complex, multi-threaded applications. TotalView is a scalable and intuitive debugger for parallel applications written in C, C++, and Fortran. Designed to improve developer productivity, TotalView simplifies and shortens the process of developing, debugging, and optimizing complex applications.
TotalView enables our research teams to develop and debug all of their applications faster, from simple prototypes to advanced, multi-threaded applications,” stated Sébastien Varrette, manager of the HPC department of the University of Luxembourg. “Our teams are experts in bioinformatics and engineering, but not supercomputers. With TotalView, they can leverage the easy-to-use, advanced debugging features to quickly debug their applications, so they can focus on their research goals.”
Already impressed by TotalView’s ability to significantly shorten debugging cycles, the HPC Center of the University of Strasbourg has selected TotalView for several new MPI and OpenMP applications. TotalView will be deployed on a NEC machine which is a Linux cluster comprised of NEC HPC1812Rd-2 InfiniBand compute nodes and NEC GPS12G4Rd-2 hybrid compute nodes with Nvidia Kepler cards.
Today Allinea announced a new scalability record on the Blue Waters supercomputer at NCSA. Now in full production mode, Blue Waters is the world’s fastest supercomputer on a university campus with a theoretical capacity of 11.62 petaflops.
While getting the machine up to speed, the Blue Waters team ran their own demanding acceptance trials with Allinea DDT debugging more than 700,000 MPI processes simultaneously.
Having Allinea DDT in the hands of users at this scale whenever the need arises and at any scale – with its lightning fast performance and easy to use interface – is a critical part of getting the scientific applications to super-petascale,” says David Lecomber, COO and founder of Allinea.The implementation of the Blue Waters system had a tough timeline – and having a debugger ready to deploy at large scales was critical to meeting the schedule. “We knew our tool was more than ready,” says Lecomber. “We wanted the NCSA to take Allinea DDT to the extreme and see it first-hand, as real users. They came back with the news that it was 30x faster than they specified in performing common debugging tasks—without any extra tuning.”
Today Allinea Software announced that the company has cracked the “performance profiling pain barrier” with the release of Allinea MAP, a powerful performance-analysis tool easy enough for scientists to diagnose problems in their own code.
Allinea MAP runs without the need to instrument or compile with special options. The program annotates the source code with performance information in colored graphs so users can see any problems at a glance. More importantly, Allinea MAP is a lightweight application that adds little overhead even when scaled up to profile tens of thousands of processes.
I think visual tools like Allinea MAP are the only way forward as we approach the daunting complexity of exascale computing,” says Rich Brueckner, president of the popular insideHPC news blog. “Algorithms that scale at hundreds or thousands of nodes tend to behave very differently at ultra-scale, where one has tens of thousands or even millions of nodes to contend with,” says Brueckner. “How one tackles such a problem requires new approaches and ways of thinking. You are never going to make parallel computing easy. What you can do is give the programmer a way to navigate in an ocean of code.”
Allinea MAP can be combined with Allinea DDT, sharing a single interface, so when Allinea MAP shows where performance bottlenecks are forming, you can flip to the Allinea DDT view and step through the code to find the source of the problem.
A lot of code out there is performing badly because the people who write and run it don’t have tools to rapidly and regularly analyze it. We’ve had HPC experts tell us they have to correct the same basic mistakes time after time,” says O’Connor. “A single optimization found with Allinea MAP can save hundreds of thousands of core hours over the lifetime of the code, delivering results faster and letting scientists focus on their real work instead of fighting the tools.”
In this video from SC12, Mark O’Connor from Allinea demonstrates the company’s new MAP performance profiler tool. Read the Full Story.
Over at HPC Admin, Douglas Eadline writes that adding and removing software from a running cluster is not as difficult as it used to be.
Regardless of the provisioning system, the goal is to make changes without having to reboot nodes. Not all changes can be made without booting nodes (i.e., changing the underlying provisioning); however, many application packages can be added or removed without too much trouble if some simple steps are taken.
Today Allinea announced that Bull will install two 500-teraflop supercomputers with the company’s software at the Météo-France center in Toulouse. By 2016, the systems will be further upgraded for a computing capability of more than 5 PetaFlops.
One major challenge of this upgrade is porting applications to work smoothly on the new computers. To ease this problem, Bull selected Allinea DDT debugging tools and Allinea MAP, an MPI profiler designed for ease of use.
Météo-France users will benefit from using these two products to debug, profile, and optimize applications, which is a big improvement that will lead to more efficient codes running on the supercomputers,” said Olivier David, Alliances Director at Bull. “The time you don’t spend on debugging, you can be running the application to get scientific results. At the end of the day, you just want to focus on science and that’s why we need Allinea Software.”
Both Allinea DDT and Allinea MAP have been fully integrated in the bullx supercomputer software suite powering the bullx supercomputers, and developers will need only minimal training before they can start spotting bottlenecks and the lines of code that slow down their applications. Read the Full Story.
In this RCE podcast, Brock Palen and Jeff Squyres speak with James Browne, Leonardo Fialho, and Ashay Rane about PerfExpert, an easy-to-use performance diagnosis tool for HPC applications with suggestions for bottleneck remediation.
This week SGI announced that the company has developed new software tools that enable customers and software developers to get the most value from Intel Xeon Phi coprocessors.
SGI UPC (Unified Parallel C) compiler, the first UPC compiler for Intel Xeon Phi, supports MPSS, the coprocessor software stack. It enables PGAS programming on SGI servers running Intel Xeon Phi. SGI UPC supports applications in native and offload modes. SGI MPInside, an advanced profiling and performance analysis tool that helps developers find bottlenecks in MPI code, now also runs on Intel Xeon Phi. SGI MPInside provides developers key capabilities to improve MPI application performance enabling “what-if” studies to project how code will perform on future architectures.
HPC customers require technology not only to deliver the best processing and energy efficiency, but also to speed advanced codes and algorithms to deployment,” said Raj Hazra, Intel VP and GM of the Technical Computing Group. “SGI’s UPC compiler is leveraging the familiar programming model of Intel Xeon Phi coprocessors. This allows customers to instantly take advantage of Intel’s new many-core technology when reusing the existing code and to achieve expected increase in performance.”
Over at HPC Admin, Dell’s Jeff Layton writes that the plethora of available processing architectures today makes it more important than ever to know your application.
Two basic approaches are available to help you understand your application: profiling, which gathers summary data when an application is run, and tracing, which presents a history of events as a function of time when the application is executed. I believe both tools can be used to gather information about your application so that you can begin to paint a picture of how your application behaves and how it interacts with the system. In my opinion, just application profiling or tracing is not enough: You also need to profile and trace the system while the application is running so you get a much more complete picture of what the application is doing and what the system is doing to support the application or in response to it.
There’s no question that computer programming bugs are costly, but are they preventable? Enter the concept of reverse debugging, which allows you to step or continue your program backward in time, reverting it to an earlier execution state.
According to recent research conducted at the University of Cambridge, the failure of organizations to adopt reverse debugging tools costs the economy $41 billion dollars of annual programming time.
This research confirms what our customers have been saying for years about the ability of TotalView to drastically reduce development time and costs during the debugging stage of software development,” stated Chris Gottbrath, Rogue Wave Principal Product Manager. “As a market leader in debugging technology, we continually advocate the time and cost savings benefit of ReplayEngine, Rogue Wave’s reverse debugging feature, and we are pleased to see robust academic research highlighting this technique as an important opportunity for the global economy.”
Researchers at the University of Cambridge’s Judge Business School conducted a survey which found that when respondents used advanced reverse debugging tools, they spent an average of 26% less time on debugging. Specifically, the time fixing bugs decreased from 25% to 18% and reworking code decreased from 25% to 19% while using reverse debuggers. This means that reverse debuggers have the potential to save 13% of total programming time, which translates to $41 billion dollars of savings to the economy or 122 more hours per year per developer towards developing additional products, features, and capabilities. Read the Full Story.
Over at ACMQUEUE, Brendan Gregg from Joyent writes that performance-analysis methodology can provide an efficient means of analyzing a system or component and identifying the root cause of problems, without requiring deep expertise. Methodology can also provide ways of identifying and quantifying issues, allowing them to be known and ranked.
Methodologies in common use today sometimes resemble guesswork: trying familiar tools or posing hypotheses without solid evidence. The USE Method was developed to address shortcomings in other commonly used methodologies and is a simple strategy for performing a complete check of system health. It considers all resources so as to avoid overlooking issues, and it uses limited metrics so that it can be followed quickly. This is especially important for distributed environments, including cloud computing, where many systems may need to be checked. This methodology will, however, find only certain types of issues—bottlenecks and errors—and should be considered as one tool in a larger methodology toolbox.
The IBM Blue Gene/Q pushes the edge of technology by providing a leadership-class supercomputer that has a homogenous multi-core architecture and relatively low power consumption. On the June 2012 TOP500® list of supercomputers, four of the top ten supercomputers were Blue Gene/Q’s. Since February 2012, TotalView users at Lawrence Livermore National Laboratory (LLNL), which was named the top supercomputer on the list, have been utilizing a pre-release version of the TotalView debugger for porting codes to take advantage of the new system. TotalView has a precedent of being the code and memory debugger of choice with users of IBM Blue Gene supercomputers, including JuQueen, the Blue Gene/Q at Forschungszentrum Jülich.”
Over at The Exascale Report, Alinea CTO David Lecomber writes that new, innovative approaches could lead to the extreme-scale development tools needed for Exascale machines.
A challenge for the future is to ensure that tool performance is maintained. The extra nodes are probably not the greatest challenge: the tree architecture in Allinea DDT can handle that. The primary concern will be to ensure that the step-change in chip-level parallelism is handled well by the tools and that will lead to interesting questions for chip and device vendors and for the operating systems developers as well as tool vendors.