EU Searches for New ‘Super Model’ in TEXT Project

Print Friendly, PDF & Email

Progress Towards EXascale ApplicaTions (TEXT)

Even the most successful superstars eventually fade from the scene. Parallel programming models are no exception (yes, I’m talking about you, MPI). Researchers in several EU centers are on a search for a fresh “star” model that might appeal to the next generation of HPC developers — Gen Ex(ascale). A collaborative team of researchers in Greece, Germany, France, Spain, Switzerland, and the UK think they may have found an attractive candidate in the StarSs programming model. They will undertake a series of pragmatic tests, using petascale machines in various HPC centers across the EU and a challenging group of real-world applications, to determine if the MPI + StarSs model (implemented as MPI/SMPSs) shows promise in supporting application development for exascale environments. The project, called TEXT — Towards EXascale ApplicaTions, is part of the EU’s competitive focus on taking a leadership role in exascale software development.

Ask the HPC community about the challenges in moving to an exascale-capable programming model and you will be barraged with a formidable list. Any given discussion is likely to include synchronization, portability of mainstream MPI-based codes, programmer accessibility, fault tolerance, reduction of communication, dealing with heterogeneity, and, of course, power management. The TEXT project won’t take them all on, but it will confront a few of the biggest ones. The HPC community is asking “Can the expense of the exascale effort be justified without providing a clear migration path for tried and true workhorse codes?” The MPI/SMPSs programming model is based on some not-so-new raw materials and concepts, but is thought to be one way to provide that portability, while addressing key requirements for increased performance, enhanced asynchrony and support for various possible heterogeneous architectures.

Increased Performance and Scalability of Existing Codes

Image of Jesus Labarta

Jesus Labarta, BSC

Dr. Jesus Labarta of the Barcelona Supercomputing Center (BSC), the Technical Manager for the project, describes the overall goal. “The main objective is to demonstrate that the hybrid MPI/SMPSs model (see sidebar, “StarSs Nomenclature”) is able to significantly improve the performance and scalability of existing production MPI applications. The project will deploy our current implementation of the MPI/SMPSs compiler and runtime in large production machines at different Partnership for Advanced Computing in Europe (PRACE) centers (Forschungszentrum Jülich Startseite (FZJ), Germany, High Performance Computing Center Stuttgart (HLRS), Germany, The supercomputing centre at the Edinburgh Parallel Computing Centre (EPCC) at the University of Edinburgh, United Kingdom, (see sidebar, “Quick View”), and will demonstrate how seven relevant applications will benefit from it. MPI/SMPSs provide an incremental and smooth path to achieve overlap between computation and communication, to overcome the too synchronous characteristics of other hybrid approaches, and to automatically load-balance the application. The project will also advance the status of tools and development environments for the hybrid MPI/SMPSs model, develop locality and load balance optimizations…. Finally, the project will try to influence the evolution of future standards — in particular the OpenMP.”

Others have recognized the value of an evolutionary path for current MPI-based application software, and one of the approaches frequently mentioned as needing additional research is in the coupling of a model for parallelism within a node with MPI for placement of tasks on a node. One recommendation being evaluated by the community is to extend languages such as Co-array Fortran or UPC with semantics that will make it safe to operate alongside MPI. StarSs is another approach.

Image of Jose Gracia

Jose Gracia, HLRS

TEXT research team member Dr. Jose Gracia of High Performance Computing Center Stuttgart (HLRS) says the blending of StarSs and MPI is essential. “Regarding MPI: In its SMPSs incarnation, StarSs only supports single-node parallelization, so MPI is the vehicle to bring it to large scales on many thousands of cores. The fact that we only use one MPI task per node rather than one for each core on a node, hopefully allows us to overcome the (perceived) scaling issues of MPI.”

Gracia seems confident of the model’s ability to scale to a very large number of cores. “We believe, that the programming model will meet the challenge of allowing applications to scale up to the level of many tens of thousands of cores.”

Tasking Capabilities and Increased Asynchrony

Image of Dimitris Nikolopoulos

Dimitris Nikolopoulos, FORTH

Another member of the research team, Dr. Dimitris Nikolopoulos of Foundation for Research and Technology (FORTH) in Greece, stresses the importance the project places on tasking capabilities. “The project augments and enhances existing codes written in MPI with so-called tasking capabilities to extract more parallelism with the aim of exploiting the many cores you find on a chip or node within a supercomputer. The project targets a specific set of applications and a specific set of machines. The applications currently scale to tens of thousands of cores and the project goal is to scale to more cores using a new programming model that actually complements the MPI. The machines are all petascale class and available at EU sites.”

Jesus Labarta points out that the model has demonstrated key advantages for increasing asynchrony. “We do believe that exploiting more unstructured parallelism than OpenMP or many other mode level models is a very important characteristic to introduce asynchrony and malleability in parallel programs. We strongly believe these two features will be extremely important in the future and are not very well supported by many existing or proposed programming models. Our experience of evaluations of other models (CAF, X10, UPC, CUDA, OpenCL) in PRACE showed a lack of support for some of the features that we think are crucial. The MPI/SMPSs does support those features; we have reported very good performance results in some codes, and we believe the status of our installation is sufficiently stable (competitive with other products), supporting both C and Fortran.”

Jose Gracia reiterates the value of asynchrony, pointing to Amdahl’s Law: “Synchronization barriers across the machine are expensive even today at rather smallish scale, but will be horrendously so on, say, 100,000 cores. Synchronization points expose load imbalance, which, in view of Amdahl’s Law, is just unacceptable even at the level of 1% (typical OS jitter).”

Back to the Future: A Superscalar Conceptual Framework

The independent, dynamically scheduled instructions that were inherent in superscalar processors of the 90s provide the conceptual framework for StarSs, and a possible pathway to increased asynchrony and throughput. Dimitris Nikolopoulos describes it this way. “The Ss in StarSs stands for superscalar, what you probably know as dynamic instructions scheduling in processors that appeared in the 90s. You ask the programmer to annotate the code to specify regions of code as tasks. These tasks are scheduled dynamically pretty much the same way the superscalar processor would schedule instructions — dynamically. The superscalar processor would take an instruction and you would see the operands of the instructions and [know] if a dependence is pending. Am I able to execute these instructions or does the instruction have to wait for some results to be concluded by other instructions? This is what the proposed programming model SMPSs does, but we are not talking about instructions, we are talking about collections of instructions annotated in the code that we call tasks. The programming model takes these tasks that have been pinpointed by the programmer, analyzes the dependencies that may exist between these tasks, and schedules the independent tasks dynamically.”

Down with Barriers!

According to Dimitris Nikolopoulos, “Barriers keep the code from accessing all of the available parallelism. If you have millions of cores, to be able to exploit them you have to be able to expose millions of pieces of independent code. Current programming models do not do that. It’s as simple as that, and this is one of the things that motivates a project like TEXT.” Dynamically scheduled independent tasks carry a very important advantage for scaling up — the potential elimination of barriers. Nikolopoulos explains, “Most current parallel codes are written in what I would say are very conservative means of synchronization. When people want to synchronize they put in a barrier, forcing all the processors of the machine to reach a certain point of execution and agree that they have reached that point. This is a global synchronization. This programming model (MPI/SMPSs) can completely eliminate the need for barriers. You say, ‘As soon as a piece of code is dependent on another piece of code and that other piece of code finishes, then the processors can continue.’ Nikolopoulos stresses that they are not all-to-all dependencies. “They are point-to-point dependencies. This reduces the synchronization overhead. This means you have more asynchrony. This is good because it translates into more available parallel code, allowing you to use more cores.”

Given this increased asynchrony, some might be concerned about accuracy. Nikolopoulos points out that the new model takes that into account. “The programming model has some formal semantics that guarantee correct execution. In fact, in superscalar processors, which have sequential program semantics, when you run a program it appears to you as a sequence of statements that you have listed in your program yourself. The SMPSs programming model has the exact same semantics. So, correctness is guaranteed. But in reality, at runtime, you allow asynchrony in parallelism. The way the programming model is implemented, it guarantees sequential semantics and correctness, but at the same time allows the runtime system to extract more parallelism.

Access to Important Codes and Petascale Systems

The applications selected for testing in the TEXT project meet a number of criteria. They are mature codes with refined algorithms, representing the state of the art in scaling and performance and are currently running effectively on petascale machines. Another criterion was the ability to get practical evaluations from users in a production environment. According to Jose Gracia, this requires “an existing user base or intimate knowledge of the code at the respective center. After initial porting of the applications, we would like to ask the user base to evaluate our implementation in their production environment with real problems rather than our synthetic benchmarks.”

“These are important applications to [project partners] and their customers,” says Dimitris Nikolopoulos. “They selected applications that were both really challenging and of interest to wide communities of scientists. The parallel computing patterns that are exhibited within these applications are very diverse. You have dense linear algebra, sparse linear algebra, graph algorithms, and irregular computations. So, the applications cover an interesting spectrum of patterns that you find in parallel algorithms. At the same time, they are of great interest to their authors and the communities that they serve.”

Clearly, this is an example of impressive pan-European collaboration, leveraging extremely important and jealously-guarded production codes, hugely expensive petascale systems in key HPC centers of excellence, and strong, funded academic resources.

A Broad Distribution of Work and Subprojects

There is a broad distribution of subprojects across EU locations and teams within the TEXT project. The project will include groups working on: compiler and runtime, led by BSC in collaboration with FORTH and others; porting of applications and libraries with contributions from HLRS, FZJ, University of Manchester, and IBM; and tools support including debuggers, performance analysis, and performance prediction from HLRS, FZJ, IBM, BSC and others.

“There are some partners providing an application/code (University of Manchester, HLRS, FZJ, UJI, IBM, UPPA, EPCC) that will port their existing codes to MPI/SMPSs, as well as exploring algorithms that better use the asynchrony supported by the model.  Programming model developers (BSC) and partners experienced in runtime systems (FORTH) will work on the deployment and support the application developers in their porting. Work on developing debugging environments and performance tools will be done by HLRS, FZJ, and BSC. Improvements in runtime (locality-aware scheduling and automatic load balancing support), and [a] GPU version of the runtime will be done by BSC and UJI,” says Jesus Labarta. This is another example of savvy geographic distribution of projects to leverage substantial expertise and to ensure support and funding across the EU.

Programmer Accessibility, Improved Interfaces

Programmer accessibility, simplicity of the interface and access to the algorithms are also said to be keys for any new parallel programming model. Gracia, Nikolopoulos and Labarta all agree on the importance of making the new programming model more accessible to HPC programmers. Says Gracia, “Typically they are not highly skilled in parallel computing models and techniques; and are neither interested nor prepared to spend many months or years in optimizing their codes for specific architectures or communication networks in order to reach high efficiency in terms of raw performance and scalability. The ideal programming model would make parallel programming…simple, but still allow it to exploit hardware efficiently. One possible path toward that ideal is to allow the programmer to share as much knowledge as possible about the algorithm at hand and leave it to the programming model’s compiler or runtime or hardware to figure out how best to parallelize it.” This is another area where the TEXT team hopes that the MPI/SMPSs model may prove useful.

The StarSs team shares a desire for simplicity of expression of parallel work with the other major programming model effort in the exascale community right now: the ParalleX project being pursued by Thomas Sterling at Louisiana State University. The MPI/SMPSs model can be described as an evolutionary approach that allows developers to continue to leverage the existing base of MPI applications. ParalleX, in contrast, is a decidedly revolutionary approach that will likely bring with it the requirement to re-implement applications from the ground up.

Jesus Labarta suggests that developers need a better interface and sense of control to succeed at exascale development, pointing out that “An interface that releases the programmer [from controlling] all the details of how a program is executed is extremely important, for portability reasons in particular. Some programmers may want to control every single detail of how a program is scheduled and executed. Our approach goes in the direction of assigning such responsibility to the runtime. We think the programmer has to focus on the algorithm and in better understanding and informing the runtime of the data and computation flow. The programs in SMPSs let the programmer concentrate on the algorithm and still provide such information to the runtime.”

Nikolopoulos concludes that the challenge is “Making the programming model so that it is not something that is reminiscent of assembly code, but more like sequential code. People have been trying to achieve this for many years. In all that time, we have failed to bring the performance achieved through handmade MPI programs to an easier to develop model such as those that use the abstraction of, say, shared memory. The gap seems to widen, rather than narrow.”

Project Funding

Jesus Labarta reports that the project budget is “in the order of €3.5 million with a European Commission contribution of €2.47 million. The project has been funded as part of the call INFRA-2010-1.2.2. The partners are also involved in the International Exascale Software Project (IESP), the European Exascale Software Initiative (EESI) and PRACE.” Labarta adds, “Information technology in general, and HPC specifically, are currently drawing considerable attention from the European Community and receive a sizeable share of the Framework Programme 7 budget.” Dimitris Nikolopoulos points out that, “Though it shares partners with PRACE, TEXT is not directly related to the PRACE initiative, which is targeted at creating a multi-tiered supercomputing infrastructure across the EU. It belongs to another category of infrastructure funding programs, which are oriented toward services and not to building hardware or physical infrastructure.”

Early Steps on a Long Road

Many questions are being asked about these early exascale projects. For example, how do we get to an exascale-ready software stack, or even a programming model, without knowledge of the architecture? Is there any real chance that the software research can inform the design of the hardware? Will any of the open environment software tested in projects such as TEXT be effective on a commodity-heavy, heterogeneous mix of processors with a smattering of special-purpose accelerators thrown in?

Even in light of this uncertainty, the TEXT project appears to be a practical test of a possible MPI implementation that may hold promise for answering a few key questions about programming models for exascale. That the actual production tests will be conducted on PRACE petascale systems using top shelf HPC codes is impressive. The equally impressive team of European experts that makes up the TEXT project team are betting that MPI will be with us for the long haul and that a programming model that runs on top of both shared and distributed memory will be applicable for exascale. Any new parallel programming model may have to be effective for both major paradigms of parallel architecture, because, as Dimitris Nikolopoulos point out, “We just don’t know which one will win the race.”

The Exascale Report will follow the TEXT project with great interest and keep you posted as to its progress.

StarSs Nomenclature

“StarSs is the generic name for the node level model. The idea comes from a ‘file expansion’ notation *Ss where the Ss stands for Superscalar and the * can have many possible values corresponding to the different target architectures. The runtime implementation may be different for different target architectures and thus we refer to SMPSs for the runtime implementation on top of general-purpose multicores and SMPs. Out of the several implementations, SMPSs targets the most widely available platform and is the target of the TEXT project. The targeted hybrid programming model would, therefore, be MPI/StarSs, as opposed to the widespread MPI/OpenMP. The actual implementation that will be used in the project is thus MPI/SMPSs,” explains Jesus Labarta.

Quick View: TEXT Towards EXascale ApplicaTions Project

Parallel programming model chosen

  • StarSs + MPI
  • Implemented as MPI/SMPSs

HPC centers and institutions involved in the project

  • Barcelona Computing Center. (BSC) Spain  (coordinator)
  • High Performance Computing Center Stuttgart (HLRS) Germany
  • Forschungszentrum Jülich Startseite FZJ (Germany)
  • EPCC, The supercomputing centre at the University of Edinburgh EPCC (United Kingdom)
  • Foundation for Research and Technology FORTH (Greece)
  • University of Manchester (United Kingdom)
  • Université de Pau (France)
  • Universitat Jaume I (Spain)
  • IBM Zurich (Switzerland)

Applications being used to test the MPI/SMPSs hybrid (also referred to in this article as StarSs/SMPSs).

  • Linear Algebra libraries
  • SPECFEM3D (Geophysics)
  • PSC
  • PEPC and BEST (Plasma)
  • CPMD (Molecular Dynamics)
  • LS1 dyn (Combustion)

Project Duration: 24 months, starting June 2010

Project Methodology:

TEXT is among the first projects to confront the reality that a new parallel programming model will be central to scalability in an era of many-core processors within heterogeneous, hierarchical systems. That’s a lot to take in, but the project takes a very pragmatic approach. The steps, as this reporter understands them are:

  • Identify a good programming model candidate for evaluation. StarSs/SMPSs was chosen as the basis of the project.
  • Select a set of complex, diverse codes that both represent a broad array of algorithms and are near and dear to the hearts of industrial, governmental and scientific users.
  • Get access to those codes and modify them to run under the new model.
  • Run the applications on genuine petascale systems, while optimizing their performance through the new model.
  • Measure, Lather, Rinse, Repeat, Report.
  • If a measurable outcome supports the potential of the MPI/StarSs model to go to extreme scale, start selling the idea to the exascale community.

Funding

  • 3.5 million Euros with a European Commission contribution of 2.47 million Euros.
  • Part of the call INFRA-2010-1.2.2.
  • The Project Partners are deeply involved in IESP, EESI and PRACE