bernhold@qtp.ufl.edu (David E. Bernholdt) (12/10/89)
We're installing MDQS as a batch queueing system on our network. It can handle submitting to queues on other machines via the network, but only to specific machines. We want to create a "distributed queueing system" for the whole network where any host can submit a job to the distributed queue, and jobs would be processed by the first available compute server. I've recently become aware of ISIS, and it appears to be quite capable of something like this, but I wonder if this might be overkill? I'm a chemist, without too much experience in distributed programming as of yet -- I would appreciate comments on the idea of using ISIS to implement this distributed queue. Is there anything simpler/better to do it with? Has anyone done this kind of thing? If it matters, we're presently running Suns, but that may change. Odds are that everything we have will have NFS, RPC, and Yellow Pages, at least. Thanks. -- David Bernholdt bernhold@qtp.ufl.edu Quantum Theory Project bernhold@ufpine.bitnet University of Florida Gainesville, FL 32611 904/392 6365
rcbc@honir.cs.cornell.edu (Robert Cooper) (12/12/89)
In article <818@orange19.qtp.ufl.edu> bernhold@qtp.ufl.edu (David E. Bernholdt) writes: We want to create a "distributed queueing system" for the whole network where any host can submit a job to the distributed queue, and jobs would be processed by the first available compute server. ISIS would be an an ideal vehicle for such a job queueing system. You would appear to have a need for reasonable reliability, so that jobs could be processed even when some machines had crashed, and that submitted jobs would not be lost when failures occurred. It is unlikely that fast performance is a strong criterion, so a simple straightforward use of ISIS would be suitable, especially since you are more interested in doing Chemistry than programming. I would structure the application as three programs: Submit, Queue, and Server. Submit would be the user command to submit a job. Queue would reliably maintain the job queue. Server would extract jobs from the Queue for processing, and notify Queue when a job had been completed. More details of each program follow. The Queue program would maintain the queue of jobs. It would accept "enqueue" messages from the Submit program to enqueue a new job, "status" messages that would report on whether a previously submitted job had completed, and a "cancel" message that cancelled a submitted but as yet unprocessed job. The Queue program would accept "getwork" messages from Server programs. A reply to this message would be sent back to the server when a job was available. When a job completed, or failed, the Server program would send a "done" message to the Queue containing a success/failure code. Some other things would be needed to turn this into a proper service. For instance some way to start up and shutdown the service is needed. As stated the program is not reliable. It can be made reliable by replicating the Queue program as an ISIS process group -- 3 members should be sufficient. Very few changes to the three programs are required to do this. The simplest way is for the Submit and Server programs to multicast all messages to the Queue group using the default atomic broadcast protocol. Each Queue program would replicate the job queue data structure and the atomic broadcast would ensure that the same sequence of updates occurs at each replica. The ISIS state transfer mechanism can be used to permit newly started up Queue programs to obtain a copy of the queue data structures. The Queue service would still be vulnerable to total failure, i.e. simlutaneous failure of all the Queue replicas. By using the logging and recovery tool this eventuality can be survived as well. Again, only minor changes to the basic message handling and queue data structures is needed. The resulting programs should total no more than perhaps 500 lines of C or Fortran, of which about 100 would be calls to ISIS routines. Chapter One of the ISIS Manual leads you through a slightly more complicated application, and would be a good guide to implementing the job queue service. The ISIS Manual also describes more sophisticated and efficient reliability and distribution methods which you might consider too. -- Robert Cooper (rcbc@cs.cornell.edu)