ken@gvax.cs.cornell.edu (Ken Birman) (07/25/90)
I get a lot of questions about using ISIS to build a scheduler that would farm out work of some sort to idle machines on a big network. My group is probably going to build a service for this but if you need something urgently, it isn't hard to cobble together a workable solution with surprisingly little code. Here's a note I mailed to someone whose network includes 600 workstations, many of which are idle at any given instant. The approach to use is roughly this: MOTHER MACHINES: 4-6 OF THE MACHINES ON YOUR NET We run ISIS on a few machines, say 4-6 or so of them, set up to fail independently (not on the same disks, NFS units, etc). On this set of machines we run an ISIS server that I will call the SCHEDULER. It maintains a task pool and a cycle-server pool. On one side it accepts new simulation tasks and on the other it chats with CYCLE SERVERS. Each mother machine runs one copy of the SCHEDULER program, which form a process group and wait for work requests. (See below) CYCLE SERVERS: THE REMAINING 600 MACHINES We modify /etc/rc to start a program called CYCLE-MASTER. The life of CYCLE-MASTER is as follows: it forks off and runs a task called CYCLE_SERVER and waits for it to die. If it dies, it runs it again. This code would all be part of a single program, even though it has two modes of execution. CYCLE-SERVER should monitor system idle time or something. When the machine has been idle for a long enough period of time it becomes active. Once active it will run simulation subtasks until the owner of the machine comes back and then it will exit (causing CYCLE-MASTER to spin off a new CYCLE-SERVER that will resume monitoring). I can think of a number of criteria that could be used to detect idle machines. When active, a CYCLE_SERVER uses isis_remote to connect to one of the mother machines. Set it up to pick from the list of 6 randomly. If the connect fails, it should try another mother machine until it finds one. Once connected it uses pg_lookup to lookup the SCHEDULER process group. Because there will be hundreds of CYCLE-SERVERS you can't use pg_client in the present version of ISIS, but starting in fall you'll do so and everything will be by BYPASS communication. Anyhow, having looked up the group it sends a message to register itself as ready to run tasks. It will now periodically receive messages from the SCHEDULER asking it to do simulation tasks. When it finishes them, it sends back a message indicating that it is done and idle again. Don't use RPC for this (the memory associated with having hundreds of active subtasks could get to be a problem). Just use asynchronous IPC. SCHEDULER GROUP: The scheduler group receives register messages and adds the server to its processor pool. It also receives new simulation task descriptions and adds these to a "pending task pool". Its scheduler subtask works by picking a task from the task pool and assigning it to an idle cycle server. It then marks the server as busy and sends it the message that fires off the simulation job. The server would use replicated data to maintain the information about who is doing what. The solution is like the one I wrote up in the book on Distributed Computing. It isn't hard but if there is a slightly tricky aspect in this whole solution, this would be it... The issue is to make sure that all the scheduler processes can deduce which task went to which server. One example of a way to solve this is to use ABCAST to send in the registration requests for cycle servers and for new tasks, in which case all schedulers can run the same algorithm to figure out who does what, and the mother scheduler for a specific server can be responsible for interacting with it. A CBCAST solution is only slightly harder; you use a token holder who decides who should do each task and periodically sends out messages telling the others what it decided. The mother-machines' SCHEDULER uses proc_watch to watch its child SERVERS and broadcasts a message to drop them from its server pool if they fail or drop out because the owner of the workstation has returned. FRONT END: A front-end program would maintain an interactive status display and support a simple command set associated with starting and cancelling jobs. If you want to get fancy, it could easily handle resource limits too, so that infinite loops can be detected and the program killed if necessary. COMMENTS: Some comments: notice that the cycle servers are not members of a process group. They just run stand-alone. The only process group is the scheduler group, and this shouldn't be too big (4-6 is a good size). Also, note that most of ISIS is confined to the 6 mother machines. The other 600 all use isis_remote to connect in and don't have anything special on them except our "cycle server" program, which is spun off by /etc/init. If your system is heterogeneous there are some things to keep in mind: first, the scheduler will need to know enough about the configuration needs of a job to make sureit places it on an appropriate machine. Also, there may be issues of file placement and access to other sorts of resources that get tricky here. Oh, and one final caution: isis_remote has a bug in V2.0 and won't run between machines with incompatible byte ordering. This is fixed in V2.1. Whether or not you compile with BYPASS communication the above will run quite fast. Later this summer when pg_client runs in bypass mode, you can just add one of these calls, recompile, benefit from it immediately. Further reading: the CONDOR system from Wisconsin does something like this but has a very fancy (and costly) approach to IO; I wouldn't do what they do, because I believe in risc software. AT&T has a system called TUXEDO for this sort of thing and HP and SUN probably have products too. No idea how good these products are. If I were building this (and I know ISIS very well) I would estimate that the cycle server program should require about 1-5 pages of C code to develop and the job scheduler another 4 or 5 pages of code. I could get this written and running in a few days, maximum. If you want to do a really spiffy job with heterogeneous resources, a fancy display and so forth, of course, this could easily stretch into a much harder job. We'll probably include something like this in the ISIS demos in the future. If someone out there builds one, please consider letting us know. Ken