Community Response

In the last issue of The Exascale Report, we posted two reader-submitted questions. The editors’ choices for the best responses from the community are listed below. We also offer this comment from Argonne’s Rick Stevens, not as a specific response but as consideration at a higher level:

“I don’t understand why everyone automatically assumes that existing programming paradigms will not scale. It’s not the programming paradigm that usually is the problem but the algorithm. To say we need new algorithms is of course nearly obvious. In my thinking, scale itself is not the problem we *might* need new programming models for. Our challenge is to address issues relating to managing alternative memory hierarchies, architectural changes for power management, computing embedded in memory, reliability etc. It is likely that only if we fail to get these right that we will need new programming models.

It would be great if every time someone says we need a new programming model, that they be asked to give examples where it is the programming model that is preventing things from working rather than the algorithm.”

Andrew Jones
Vice-President
HPC Services and Consulting
Numerical Algorithms Group (NAG)

Question submitted by Andrew Jones, NAGQ: What should be done about the applications that won’t be able to exploit thousands of GPUs together in a single simulation, for example because of algorithm limitations or legacy coding issues. Will the community be supported to develop and implement new algorithms for those codes? Or will it be acceptable for those codes to use large numbers of nodes but leave the GPUs idle? Blue Waters had a significant planning and preparation effort with the applications development community – what are the plans in this respect for Titan?

Response: Jack Wells, ORNL

ORNL is committed to helping our users take advantage of Titan through an ongoing program of training events, tutorials, and reference articles (see http://www.olcf.ornl.gov/titan/training-support/) To be sure, work will be required to move applications onto Titan, for example, in restructuring codes to more fully express available parallelism and utilize the hybrid architecture through application of open programming standards, e.g., OpenACC (http://en.wikipedia.org/wiki/OpenACC) . In designing Titan, ORNL chose the node architecture with a one-to-one ratio of GPUs to CPUS specifically to allow users to transition from homogeneous to heterogeneous architectures and programming modes. ORNL is working with application teams focused on the six codes mentioned in the interview (http://www.olcf.ornl.gov/titan/early-science/) as the first phase of scientific productivity on Titan. Many of the software improvements realized to date are transferable to real speedups on CPU-only, homogeneous architectures. In short, faster codes are being written. We understand that our users run at any and all HPC centers available to them. So, the code that runs at ORNL, for example, must run effectively at Argonne and Lawrence Berkeley National Laboratories.

The majority of time on Titan will be allocated through the INCITE program, managed jointly by the Argonne and Oak Ridge Leadership Computing Facilities. As one should expect, the ability to effectively utilize Titan’s full, hybrid architecture will continue to be an integral part of the INCITE peer-review process. However, overall potential for computational impact will remain the primary basis for evaluating proposals for allocations on our leadership computing resources. For more information on the INCITE program and the selection process, please see https://hpc.science.doe.gov/allocations/calls/incite2012.

Response: Steve Conway, IDC

I suspect algorithm development to exploit CPU-GPU and many-GPU configurations will proceed in a semi-haphazard fashion. As you know, GPUs are still largely in an experimental phase, so it’s early in the game, and GPUs are not well suited to all applications . In industry, market forces will ultimately drive the application choices. In science, application choices will be less straightforward: low-hanging fruit, a site’s willingness/ability to write the algorithms and adapt the codes, funding, and other factors will play roles. I do believe that necessity is the mother of invention, and purchases of large numbers of GPUs by leading HPC sites will need to be justified over time.

Response: Steve Scott, NVIDIA

Future performance gains will come almost entirely from parallelism, and power constraints will dictate that HPC systems become heterogeneous in nature. CPU cores optimized for single thread performance will become increasingly power-inefficient compared to cores optimized for throughput and energy efficiency. The applications that will scale most effectively on future machines are those that expose the most parallelism. With O(GHz) clocks, codes will require O(billion-way) parallelism to achieve exaflop execution rates. If a code can’t take advantage of GPUs or other accelerators, it will risk being left behind, and should likely be restructured.

Compiler directives such as the newly announced OpenAcc standard allow the expression of parallelism in a platform-independent manner, capable of mapping to both accelerators and standard multicore CPUs. Both NCSA, with Blue Waters, and ORNL, with Titan, have chosen to embrace GPUs as the logical next step to our heterogeneous, energy-efficient future. Both groups have substantial efforts underway to optimize their codes to enhance scalability and parallelism for future platforms, and both plan to have many GPU-enabled codes ready for the bring-up of their new systems.

John Barr
Research Director High Performance Computing
The 451 Group

Question submitted by John Barr, 451 GroupQ: It is not the thousands of processors that should concern us, but the millions of cores that are just around the corner, and the billions of threads that will be required to exploit these systems efficiently. I believe that the industry needs radical new programming paradigms for Petascale and Exascale systems, and that the paradigms we use today won’t scale – and if they did there is no realistic provision for resilience. The issues that must be addressed are massive concurrency, program hierarchy, heterogeneity, program portability and application resilience. What programming paradigm will address all of these issues?

Response: Jack Wells, ORNL

Thank you for your question.

ORNL is engaged with our vendor partners Cray, CAPS, NVIDIA, and PGI in developing compilers and libraries for Titan that support the OpenACC standard (http://en.wikipedia.org/wiki/OpenACC) to simplify parallel programming of heterogeneous CPU/GPU systems. We are also engaged with vendor partners in developing performance and debugging tools for Titan and these will be widely available to the community through these vendors. For more information on these developments, please see http://www.olcf.ornl.gov/titan/development-tools/. We believe that these efforts will enable our users to address the issues of heterogeneity and program portability that you identify.

Moving forward, programming paradigms for exascale computers are, indeed, a big challenge. At ORNL, the paradigm we are advancing with our users is to first evaluate opportunities to restructure application codes to reveal more levels of parallelism that can be expressed on Titan’s massively parallel, hybrid architecture. This essential activity concerns application design and software engineering, in addition to programming. However, through such an approach, massive concurrency can be effectively engaged and hierarchical programs structured such that they can be efficiently managed.
Application resilience is an active area of research within the HPC community with many open issues to be resolved.

Response: Steve Conway, IDC

I couldn’t agree more. Unfortunately, few users to date have been willing to move to PGAS or other more-efficient parallel programming models. The objections are to learning the new models and making the effort to rewrite/adapt existing codes. Users typically believe that many codes will need to be rewritten, and in some cases entirely rethought, for use on many-petaflop and exaflop systems, but most users are delaying the inevitable as long as possible. Personally, I am also concerned that there may not be enough people on Planet Earth with the appropriate brainpower and skills to develop novel models and algorithms for all deserving applications.

Response: Steve Scott, NVIDIA

You are spot on in this assessment. The number of nodes in the largest systems will grow by only few times to reach an exaflop, while the number of threads will grow by orders of magnitude. While there are some promising new languages emerging (Chapel or X10 anyone?), the large existing base of high-end HPC code will require a more evolutionary path. Concurrency can be addressed by adding another level of parallelism within the node. Directives are a reasonable approach for doing this, since compilers are still not able to perform the whole program analysis required to reliably detect safe parallelism on their own. There will likely need to be hard work put into restructuring and optimizing the codes themselves to expose the parallelism that will be required.

The trickier part will be adding mechanisms to express locality, so that the compiler and runtime can manage allocation in the memory hierarchy and minimize data movement. This may be feasible via code annotations. We will also need mechanisms for performing fault containment and localized replay in order to make applications more resilient. There are promising approaches along these lines being explored within NVIDIA’s DARPA UHPC Echelon project and elsewhere. In the meantime, the integration of non-volatile memory into the fabric of future supercomputers will enable much faster check-point and recovery times, enabling the current paradigm of user-initiated checkpoint/restart to scale much further.

Response from Wolfgang Gentzsch, HPC Consultant

Exaflops Biodiversity

Even the dozens of cores sitting in a small cluster of say 8 nodes concerns me today, because here we already face two levels of parallelism. Even if we control application parallelism on the node level, we don’t usually control it on the core level: either we waste cores because of memory demands, or we use all cores but then generate uncontrolled memory contention.

Now extrapolating this to thousands of processors (x86, GPU, ARM) and millions of cores, what concerns me most (in addition to the millions of cores) is the paradigm of heterogeneity which seems unavoidable because of our energy saving demands. This indeed needs a radically new programming paradigm, one which might be heterogeneous itself.

At the core of it (no pun intended), we have the demand to speed-up especially the hot spots of our program, the numerical (e.g. algebraic) algorithms. This is Jack Dongarra’ s task (Jack here stands synonymously for all the great library guys in this community), and as long as these guys continue to research and deliver, I am not so much worried about this area. In addition, in this area, I am still waiting for the chip companies to develop the Algebraic Algorithm Processor optimized for handling dense and sparse matrix and vector operations.

But there lies also a chance in this heterogeneity: With x86, GPU, and ARM architectures on the cluster node, we would have the choice of matching a set of specific operations (a computational object) onto the best-suited architecture component; and we even could ‘build and buy’ the machine according to the requirements of our application program – some people call it co-design. And, the diversity of species present in an ecosystem can be used as one gauge of the health of an ecosystem ( this is a quote I borrowed from a bio lesson).

Besides these ‘hot spots’, the rest is anyway Amdahl garbage (scalar dirt), and the only way to handle it is to reduce it to a minimum.

For related stories, visit The Exascale Report Archives.

Filed Under: The Exascale Report

Sponsored Guest Articles

Life Is Fleeting, But Data Is Forever – Meet Your Digital Twin

White Papers

Energy efficiency drives HPC to the cloud

Featured RSS Feed

More News from insideBIGDATA