How are parallel jobs launched?

To launch a parallel job means to start a process on each compute node and then to link those processes together through the high-performance network, such InfiniBand or Myrinet. Of course this means that the processes must some how know about each other to make the connection. One mechanism is to use a shared disk if all of the compute nodes have access to a distributed file system. A more portable method is to rely on the management network (usually Ethernet) to communicate each process’s information during startup. This later scenario implies that the computing system must rely on sockets even if DAPL will be the main communications library during runtime.

As for actually launching the jobs, the most straight-forward way is to SSH into each compute node and run the program with whatever parameters are needed to find the other processes. While this task can be automated with a script, the launch will be pretty slow for a large number of compute nodes as the SSH commands must be run one at a time from a root process.

A more sophisticated method of launching jobs is to have a user-level daemon run on each compute node. When the user wants to start a parallel job, the local daemon launches the process on the local node and then alerts the daemon on an arbitrary neighbor node. That daemon then launches the process and alerts its neighbor, and so on. This method is much quicker and is becoming the scheme of choice.

The Multipurpose Daemon (MPD) from MPICH2 provides this functionality, not just for MPI but for all parallel jobs. It has not been widely pushed, however, and appears to be rarely developed outside Argonne. The other MPI implementation (Open MPI) has spun off a project to provide more wide-spread parallel job management: Open Run-Time Environment (OpenRTE). There are plenty of other job managers that are not tied to a particular MPI implementation, such as BProc, Globus and Condor.

As for when to launch what, small systems can allow interactive loading via MPD or OpenRTE. Large-scale clusters, however, will need a job queue for better management. The most common is Portable Batch System, which is available as OpenPBS or TORQUE. (The commercial PBSPro boasts more functionality).

All in all, parallel job launch is a pretty difficult task to do correctly. Ideally the launcher should be scalable, fault tolerant, easy to use, etc. Building one from scratch for a new communications library is a fair bit of extra effort, so MPD and OpenRTE definitely have their advantages. Many smaller libraries don’t even bother using these environments directly and instead merely call MPI to “bootstrap.” Whatever the communications environment of the future looks like, parallel job starts will always be a difficult part of process.