Moving beyond Pax MPI into the Exascale

Print Friendly, PDF & Email

The execution model is the machine

As part of our series with some of the big thinkers in HPC today, insideHPC talked with Thomas Sterling (the Arnaud and Edwards Professor of Computer Science at Louisiana State University) about execution models and their role in the next generation of exascale computing.

Thomas SterlingAs Sterling says in his talks, the execution model is the machine: not as a virtual machine, but as the vertical cross-cutting model that binds all layers in to a single operational domain. We routinely interact with aspects of the execution model, ways of looking at the system that embody some subset of fundamental characteristics of a system’s execution model. MPI, for example, isn’t an execution model, it is an API that lets developers create applications that cooperate on a single task in the Communicating Sequential Processes execution model. Likewise with the operating systems and architectures that all manifest other aspects of the CSP model into which most of our computing falls today.

In many respects the execution model is an exercise in formalism, particularly when one looks back and realizes that through all the previous phases of computation our models have adapted to the hardware that designers presented us with, which was itself a reaction to the available technology. We’ve gotten along just fine without a lot of upfront work on models, so why bother now?

It is Sterling’s take that exascale computers will not be a refinement of where we are today. The hardware, system software, and application infrastructure needed to support billion-way parallelism is a fundamental departure from the architecture curve we have been on to date, and we need to step back and critically assess where we want to go before we strike off in whatever direction seems most solicitous at the time. We probably can get to the exascale without doing a lot of heavy lifting on the execution model upfront, but with so much at stake, why take the chance?

Sterling and his team have proposed a model of exascale computing called ParalleX (you can read a good overview of execution models and ParalleX in Sterling’s talk from earlier this year to the NSA ACS Research Program Workshop). A complete description of ParalleX is beyond the scope of this article, but that presentation is a good place to start (as is the ParalleX website or this paper, IEEE CS Digital Library subscription required). Some of the key points of ParalleX are that it supports locality domains that express the hierarchical nature of exascale machines, with attributes like asynchronous action between localities but synchronous action by resources (processors) within a locality. ParalleX also has a global address space, but it is not cache-coherent, and it supports split-phase transactions so that there are no hidden idle time penalties for remote access. As you might expect from a model meant to support billions of cooperating threads, there is support for failure with micro-checkpointing. Interestingly, the model supports message-driven computation, with parcels that can carry work to data and support for continuation, a key element in both fault-tolerance and minimization of data movement.

It’s not a topic that gets a lot of play, but Sterling says that he’s given so many presentations in the past month that he’s starting to lose track. A sign that his ideas are gaining momentum.

insideHPC: Based on what I’ve read of your research, it is incorrect to say that MPI is an execution model but it implicitly rides on top of an execution model. As a starting point, can you describe the common programming model that most of us are working with today in MPI-based large-scale application development?

Comparing CSP to ParalleXThomas Sterling: MPI is an API (forgive the TLAs) that reflects the underlying model of computation referred to as “Communicating Sequential Processes” and informally and inexactly referred to as “message-passing.” One of the challenges of depicting an execution model is that almost any representation looks like an API itself. But the key distinction is that a programming model asserts a layer of abstraction separating the underlying system (including both hardware and software) from the programming environment and application codes. But taking on your challenge about the CSP model, see the image to the right for some quick thoughts (click for a larger view).

This depiction of the CSP model is more focused on MPI-1 with some additional points such as single sided accesses in MPI-2 to be considered as well, but this is a corruption of the original CSP model.

insideHPC: The exascale model you propose seems to be much richer than our current execution model, since it includes control of physical resources in the bulk hardware that we don’t manage now (like power). Are these attributes that will necessarily be managed explicitly by applications, or will other layers of the system manage some of those items cooperatively with the application?

Sterling: As you note, the emerging ParalleX model encompasses dimensions of reality ignored (reasonably) by previous generation models. But now these will dominate future Exascale systems. The key dimensions that must be considered by the model and be realized at some level in the system stack (in most cases below the programming level) are:

  • Scalability
  • Efficiency
    • Mitigation of starvation, latency, overhead, and waiting for contention (SLOW)
  • Power
    • Off by two orders of magnitude or more where we need to be in terms of total average energy per operation (25 pJ then versus 5000 pJ now)
  • Reliability
    • Single point failure mode MTBF will be less than the time required to check point the system wide memory
    • Requires a model that is fail-safe, i.e., operates through single point failures
  • Programmability
    • Removes many of the current burdens from the programmer, especially in managing resource allocation and task scheduling
    • Facilitates exposure of substantially more parallelism than is exhibited today
    • Hides performance implications of invisible system behavioral properties such as cache misses, TLB misses, and system-wide latency

Most of these will not be exposed to the programmer (of course they could be and PXI — a very low level API for ParalleX — does) and will be handled in part by compilers but more often by advanced runtime system and in the case of overhead new architectural mechanisms. My opinion is that the programmers should expose the parallelism of their application algorithms using an easily adopted and highly flexible programming model (I suggest event-driven threads semantic constructs) and allow the runtime system to exploit runtime knowledge, not available to the programmer or the compiler, to dynamically and adaptively optimize scheduling and resource allocations within the constraints of causality. Yes, post mortem analysis can even offset this last caveat in some cases and to some degree.

insideHPC: Is it the case that we can decide on a model of execution now and start the application and algorithm work ahead of having the hardware? I assume we can at least simulate how those systems will behave if we have the model now — is that part of the point of having this discussion now?

Sterling: My position is that we can and we must, driven by prior experience.

Before MPI was developed, there were years of earlier programming languages and software systems, both open source and proprietary. The CSP model itself was put forth in the late 70s. You may remember Hoare’s Occam, ORNL’s PVM, and many others before the adoption of MPI in the early 1990’s by the community. That’s more than a decade building up to MPI, and we only have that much time now if we are going to be prepared for the end of the next decade.

Indeed, given that we have to build the software stack and allow it to mature in readiness for production quality computing by 2018-2020, we are already behind, if history is to be our guide. However, in fairness, we are not so naïve as we were; we’ve been around this block a couple of times now (CSP, SIMD, Vectors, Dataflow) and if we chose to, we can move forward with professional alacrity. Furthermore, we don’t have to get it entirely right the first time; we never do. But we do have to get it sufficiently complete such that it can deliver useful and superior performance with respect to conventional practices. It’s not that we can’t exploit incremental methods of improvement, we just have to get on the right slope before we can let incrementalism reengage.

We can decide on many attributes of a model now knowing the kinds of properties required, such as the exposure of substantially more forms of parallelism or intrinsic system-wide latency hiding. More than one model may exhibit these properties and other desirable ones. ParalleX is an example of such a model, not necessarily the final right model. But it does provide an exemplar and formalism for experimentation of a number of powerful ideas which in some cases have been considered for well over a decade (e.g., message-driven computation, futures). We can certainly use simulation to explore future system operational properties such as those using new architectures.

But there is an easier way to employ and evolve a ParalleX-like model. Even today, some applications like Adaptive Mesh Refinement problems or dynamic graph-based applications are constrained in their scalability and efficiency using conventional programming techniques. The forms of parallelism and heavyweight synchronization semantics and mechanisms are not well-suited to these problems. Applications that take many weeks to complete nonetheless have been shown to scale to less than a thousand cores. A model like ParalleX may mitigate some of these constraints even now on conventional MPPs, or even commodity clusters. Implementations made practical on today’s systems therefore may be of some utility even now, at least for some special cases.

This would suggest the development of new runtime system software optimized for lightweight user threads dynamically scheduled. Our team has implemented an early example of such a runtime system, HPX. Others, like Cilk, have shown that parts of such a model can be realized efficiently. So we do not have to wait to get some useful work done with advanced models, nor to get meaningful experience with such models leading to evolved and advanced systems. It can start now. The recent RFI for the DARPA UHPC program suggests that such an approach may be valuable in the near term.

insideHPC: Up to now our community has just adapted to the hardware it was given, rather than planning a model out first. What are the benefits of deciding the model first? How does it help us get to exascale faster, or improve our experience once we’re there?

Sterling: So this is the deep question, and the thing that separates the reality of this approach from some mere academic exercise.

Historically (with the possible exception of dataflow), the hardware architectures were reactions to the technology opportunities and challenges, while the execution models were derived implicitly to support the new architecture classes and guide the software and programming models that were derived to employ those architectures and their underlying enabling technologies. The dataflow model was developed first, and architectures attempted to exploit it. That was one of the reasons it failed to gain real traction. It wasn’t needed, and was suboptimal in terms of overhead and other characteristics.

Now, however, it is different. There is no obvious good way to design new architectures that satisfy the SLOW requirements to efficiency. Multicore/manycore is a necessary act of desperation in response to flatlining of clock rates due to power considerations. Our flirtation with GPUs is a recognition that alternative structures can exhibit superior performance and efficiency for certain data flow execution patterns. But neither is derived around the needs of full system structures, semantics, and operational properties; they are still focused on the local node. We know many of the stress factors implicit in Exascale in part because of studies such as the NSF Exascale Point Design Study, the DARPA Exascale studies (report here), and the DOE Exascale workshops. Even the International Exascale Software Project meetings (I am contributing to one right now in Tsukuba, Japan) are revealing needs and possible paths towards Exascale system realization.

Effective system-wide architectures — that is the design of a core architecture intended and devised to operate in cooperation with a billion other like cores — is not understood, and does not fall under current architecture design practices. It is in the derivation of such a new family of core architectures, the systems that comprise them, the system software that manages them, and the programming models that apply them to real world problems that a new execution model at this time can serve the community pursuing Exascale by providing a set of governing principles by which to guide the co-design of all system layers of a future class of Exascale systems to be ready prior to the end of the next decade.

insideHPC: Who else is thinking about execution models for exascale? This isn’t a topic that’s come up in my other discussions with people who are thinking about exascale, but this might be my own selection bias toward thinking about things I already know about, hardware and programming models.

Sterling: No you are right. This is not a common topic of discussion; which is strange because we all use models of computation. But routinely architects think they are architectures, programmers think they are programming models and languages, and OS designers think they are the operating system. In some sense they are all right, even as they are wrong in the sense of their limited perspective.

We can be faulted for living in the comfort of the era of “Pax MPI.” It has atrophied our awareness of the deeper overriding paradigm that guides us all. The Execution Model is the Machine; not as a virtual machine, but as the vertical cross-cutting model that binds all layers in to a single operational domain. There are others with this perspective: Sarkar’s work at Rice, Gao’s work at Delaware, Kale’s work at Illinois, the Cilk work from MIT (although limited in scaling), and of course the venerable multi-threaded work from Smith and Callahan (both now at Microsoft). My apologies to those whom I failed to but should have mentioned. There is even some resistance to this philosophy although it has many decades of precedence. But one way or the other, either by planned and guided intent or by happenstance and random steps, a new model will emerge to drive the layers of structure and their interoperability comprising future systems. It has always happened when sufficient technology stress has catalyzed the essential paradigm shift, as is happening now with multi-core and GPU designs.

insideHPC: Thinking about a decade of development to get there, when do we need to have decided on an execution model if it is going to have an impact? How is such a decision made as a community?

Sterling: Our modest work is research and in that sense exploratory. It is not intended as an effort to sell an agenda or convince vendors. However, in fairness, in explaining our intellectual approach and its underlying precepts there has been an aspect of educating the community. But then, how many computer science students are taught about dataflow, vector architecture, or the SIMD model?

Timeline of execution modelsA generation of experience primarily of the work in the 1980s has largely been lost and a broad perspective that may be important now as we face the ultimate challenges brought on by near nano-scale technologies within two design cycles of the present. Obviously my viewpoint is that now is the time to be thinking about and setting on the path of a new model of computation. As in the previous phases of HPC (I think this is going to be Phase VI — see the timeline in the image), the basic ideas of an execution model were put forth early, and then many iterations and successive refinement occurred as experience nurtured convergence.

The community makes the cumulative decision through the experience of reference implementations and specific applications that benefit from them. Over time, more applications achieve advantage through them even as the model itself improves. This is done, quite possibly, in the healthy environment of a tension among multiple models and their respective implementations; each perhaps benefitting from the experiences of the others. Then, perhaps as in the case of MPI, a new community-wide programming model that takes advantage of an underlying model of computation, is formalized, agreed upon, and employed. It’s not a perfect method but it has the virtue of bringing us to the next reality plateau. If my research has positive results, I hope it will receive the attention of the community. If it shows negative results, then I hope those experiences too will contribute to the alternative path that the field will ultimately take.

insideHPC wants to thank Dr. Sterling for carving out a slice of his time as he travelled around the world, trying to make sure that the exascale machines of a decade from now will have been worth the wait.