Interview: Victor Eijkhout on an Integrative Model for Parallelism

Print Friendly, PDF & Email

The Integrative Model for Parallelism at TACC is a new development in parallel programming. It allows for high level expression of parallel algorithms, giving efficient execution in multiple parallelism modes. We caught up with its creator, Victor Eijkhout, to learn more.

insideHPC: What is wrong with parallel programming today?

Victor Eijkhout, TACC

Victor Eijkhout, TACC

Victor Eijkhout: I see several problems with the state of parallel programming.

For starters, we have too many different programming models, such as threading, message passing, and SIMD or SIMT processing. This can be explained by the fact that each model corresponds to a completely different of hardware parallelism, but it leads to codes that are somewhat hardware specific, and therefore hard to port to new systems.

In the worst case you may wind up with more than one mode of parallelism in the same code. That is hard to code, finding the right balance for performance is hard, and you may need to rewrite that code in a couple of years as new hardware types come on the market.

So there is a question whether it is possible to abstract away from the specific mode of parallelism.

Next, it bothers me that these programming models are conceptually just a small abstraction from the hardware, and not necessarily close to the level where a domain scientist would want to express the algorithm. General systems to do this have had limited success, mostly from a performance point of view.

On the other hand, domain-specific packages like PETSc have high level abstractions that still give you high performance.

This means that it _is_ possible to have high level expressions of algorithms, but the question is how general that can be made.

And finally, if I wear my theoretician’s hat, I don’t like that there is very little theory underlying the type of parallel computing that happens in HPC. For instance, most theories of parallelism are typically about interactions between independent processes. This does not capture that in our SPMD model the processes are in a sense identical, and it certainly does not capture that the processes are really working together on something macroscopic like a matrix-vector product.

insideHPC: How do you propose to address these problems?

Victor Eijkhout: Many of the above problems can be solved, if we take a couple of steps back and rethink how we express parallelism.

Let me return to the thought that we really want a `language’ that can express algorithm concepts, without having to spell out the parallelism aspects. In particular, what you really want is a language with `sequential semantics’: the user should only have to write a sequential program that says “matrix times vector, inner product of the output with another vector” and the parallelism is not in those statements: it comes from the fact that the matrix and the vectors are distributed objects. Unfortunately, systems that tried this have not found mainstream success for a number of reasons, one being that compilers and middleware just can’t derive messages and task dependencies all on their own. The programmer clearly has to specify something about the parallelism. The question is: how little is enough?

My contribution addresses this question for `data parallel operations’: operations where each component of a distributed output object is independently computable from certain components of an input distributed object. (This covers much of linear algebra, mesh operations in Finite Elements, many graph and machine learning
algorithms, et cetera.) The somewhat remarkable fact is then that specifying two things about an algorithm is enough to derive MPI messages and task dependencies and such.

Firstly you need to specify the distribution of an object, which I define slightly differently from usual. Secondly you need the `signature’ of the function you’re computing, which is something that is completely independent of the parallelism! And then everything falls into place, as should be obvious from the concepts, but I wrote some code to demonstrate it anyway….

If you realize that both task dependencies and messages are really the dependency arcs in a dataflow formulation, you now have an intermediate representation, automatically derived, that can be interpreted in multiple parallelism modes.

insideHPC: What is an Integrative Model for Parallelism and what prompted you to develop it?

Victor Eijkhout: The Integrative Model for Parallelism is my formalization of these ideas, and their implementation as a software system as a C++ library. My design for the IMP concepts has gradually materialized over the past four or so years. Basically, it started by seeing some ideas that I disagreed with, followed by my trying to formulate what I would find an elegant solution.

There are three levels to the IMP model.

At the top is an API of concepts that allow for programming with distributed objects, that is, in sequential semantics. I hope that this API is sufficiently expressive for a range of HPC algorithms, but that is something I need to prove by demonstration.

Then there is the intermediate dataflow representation of the algorithm, which is derived from the API calls, but not programmed explicitly.

Finally, that intermediate representation is interpreted in terms of the mechanisms of some underlying parallel programming system such as MPI.

The remarkable thing is that, because the intermediate representation is independent of the parallelism mode, the same source code can actually be translated to message passing, or a task graph model in shared memory, or hybrid combinations of these.

insideHPC: What are the next steps?

Victor Eijkhout: I have some prototype software. I’ve used it to realize some examples, for instance a sparse matrix conjugate gradient algorithm, and that code performs (as both MPI and OpenMP tasks) with about the right performance characteristics. Right now I’m working on the Lulesh proxy-app. The gather and scatter operations in that make a good test of how expressive the IMP vocabulary is.

Next I want to show that hybrid shared/distributed memory computing is possible, from the exact same user program. Fancy things like load migration, communication avoiding, 1.5D and 2.5D algorithms, that are typically hard to do should all be possible in IMP, but the software does not exist yet.

And I have a bunch more research questions that need to be tackled. Co-design is another research direction: since the IMP model keeps track of where data is, hardware can be simplified to lose out-of-order processing, coherence, and transparent caches. This should give a considerable energy savings, even at the current state of technology.

In practical terms, a next step is finding a collaboration to develop IMP into a production-level system for some specific application.

insideHPC: How can our readers learn more?

Victor Eijkhout: The best source for information is the project web page: http://pages.tacc.utexas.edu/~eijkhout/imp/ which has links to some reports to get the reader started, as well as the code repository.

Victor Eijkhout is the author of “Introduction to Scientific and Technical Computing” which is free to download and also available to purchase in hard copy.

Sign up for our insideHPC Newsletter

Comments

  1. We have developed such programming model (namely, partitioned global workflow) here at ETH Zurich in 2011. It allows one to write distributed parallel applications staying within the serial paradigm while just stating where the code parts should be executed (with the rest inferred from the dataflow graph – including MPI transfers and threading).

    You might check bindcxx.org for the implementation details (and linear algebra framework based on this model at ambientcxx.org).

    • Victor Eijkhout says

      Your library has a different concept of sequential. You seem to be working with a sequential program that explicitly spawns tasks/threads on other cores/nodes. In my model, the user program never talks about processes or threads or nodes: it operates on global objects.

      • It’s not exactly true. The shared memory parallelism is completely implicit and is inferred from the DAG of operations that is collected during the program execution. What you see are simply function’s decorators that enable the framework’s functionality. For example in Ambient version there are no decorators – but one has to “register” the kernels.

        What is indeed explicit are the scopes – in the beginning we were trying to schedule this DAG completely automatically across the MPI procs but after some experiments we decided to make the scope marking semi-explicit (acting on the C++ scope) due to high cost of unoptimal scheduling (in Ambient version if the scope is not specified it checks the affinity of an object to perform the automatic scheduling).

        And finally, in our case the objects are actually global with every process having knowledge about every object’s state to perform automatic data-transfer in case of a DAG dependency…