[comp.sys.isis] ISIS for a distributed queueing system?

bernhold@qtp.ufl.edu (David E. Bernholdt) (12/10/89)

We're installing MDQS as a batch queueing system on our network.  It
can handle submitting to queues on other machines via the network, but
only to specific machines.  We want to create a "distributed queueing
system" for the whole network where any host can submit a job to the
distributed queue, and jobs would be processed by the first available
compute server.

I've recently become aware of ISIS, and it appears to be quite capable
of something like this, but I wonder if this might be overkill?  I'm a
chemist, without too much experience in distributed programming as of
yet -- I would appreciate comments on the idea of using ISIS to
implement this distributed queue.  Is there anything simpler/better to
do it with? Has anyone done this kind of thing?

If it matters, we're presently running Suns, but that may change.
Odds are that everything we have will have NFS, RPC, and Yellow Pages,
at least.

Thanks.
-- 
David Bernholdt			bernhold@qtp.ufl.edu
Quantum Theory Project		bernhold@ufpine.bitnet
University of Florida
Gainesville, FL  32611		904/392 6365

rcbc@honir.cs.cornell.edu (Robert Cooper) (12/12/89)

In article <818@orange19.qtp.ufl.edu> bernhold@qtp.ufl.edu (David E.
Bernholdt) writes: 
   We want to create a "distributed queueing
   system" for the whole network where any host can submit a job to the
   distributed queue, and jobs would be processed by the first available
   compute server.

ISIS would be an an ideal vehicle for such a job queueing system.
You would appear to have a need for reasonable reliability, so that
jobs could be processed even when some machines had crashed, and that
submitted jobs would not be lost when failures occurred. It is unlikely
that fast performance is a strong criterion, so a simple straightforward
use of ISIS would be suitable, especially since you are more interested in
doing Chemistry than programming. 

I would structure the application as three programs: Submit, Queue, and
Server. Submit would be the user command to submit a job. Queue would
reliably maintain the job queue.  Server would extract jobs from the Queue
for processing, and notify Queue when a job had been completed. More
details of each program follow.

The Queue program would maintain the queue of jobs. It would accept
"enqueue" messages from the Submit program to enqueue a new job, "status"
messages that would report on whether a previously submitted job had
completed, and a "cancel" message that cancelled a submitted but as yet
unprocessed job. The Queue program would accept "getwork" messages from
Server programs. A reply to this message would be sent back to the server
when a job was available. When a job completed, or failed, the Server
program would send a "done" message to the Queue containing a
success/failure code. Some other things would be needed to turn this into a
proper service. For instance some way to start up and shutdown the service
is needed.

As stated the program is not reliable. It can be made reliable by
replicating the Queue program as an ISIS process group -- 3 members should
be sufficient. Very few changes to the three programs are required to do
this.  The simplest way is for the Submit and Server programs to multicast
all messages to the Queue group using the default atomic broadcast
protocol. Each Queue program would replicate the job queue data structure
and the atomic broadcast would ensure that the same sequence of updates
occurs at each replica. The ISIS state transfer mechanism can be used to
permit newly started up Queue programs to obtain a copy of the queue data
structures.  The Queue service would still be vulnerable to total failure,
i.e.  simlutaneous failure of all the Queue replicas. By using the logging
and recovery tool this eventuality can be survived as well. Again, only
minor changes to the basic message handling and queue data structures is
needed.

The resulting programs should total no more than perhaps 500 lines of C or
Fortran, of which about 100 would be calls to ISIS routines. Chapter One of
the ISIS Manual leads you through a slightly more complicated application,
and would be a good guide to implementing the job queue service.
The ISIS Manual also describes more sophisticated and efficient reliability
and distribution methods which you might consider too.

                              -- Robert Cooper (rcbc@cs.cornell.edu)