[comp.sys.isis] Roll-your-own supercomputer

ken@gvax.cs.cornell.edu (Ken Birman) (07/25/90)
I get a lot of questions about using ISIS to build a scheduler that
would farm out work of some sort to idle machines on a big network.
My group is probably going to build a service for this but if you need
something urgently, it isn't hard to cobble together a workable solution
with surprisingly little code.

Here's a note I mailed to someone whose network includes 600 workstations,
many of which are idle at any given instant.  

The approach to use is roughly this:

MOTHER MACHINES: 4-6 OF THE MACHINES ON YOUR NET
  We run ISIS on a few machines, say 4-6 or so of them, set up to
  fail independently (not on the same disks, NFS units, etc).  On
  this set of machines we run an ISIS server that I will call the SCHEDULER.
  It maintains a task pool and a cycle-server pool.  On one side it accepts
  new simulation tasks and on the other it chats with CYCLE SERVERS.

  Each mother machine runs one copy of the SCHEDULER program, which form
  a process group and wait for work requests.  (See below)

CYCLE SERVERS: THE REMAINING 600 MACHINES
  We modify /etc/rc to start a program called CYCLE-MASTER.  The life of
  CYCLE-MASTER is as follows: it forks off and runs a task called
  CYCLE_SERVER and waits for it to die.  If it dies, it runs it again.
  This code would all be part of a single program, even though it has
  two modes of execution.

  CYCLE-SERVER should monitor system idle time or something.  When the
  machine has been idle for a long enough period of time it becomes active.
  Once active it will run simulation subtasks until the owner of the machine
  comes back and then it will exit (causing CYCLE-MASTER to spin off a
  new CYCLE-SERVER that will resume monitoring).  I can think of a number of
  criteria that could be used to detect idle machines.

  When active, a CYCLE_SERVER uses isis_remote to connect to one of the mother
  machines.  Set it up to pick from the list of 6 randomly.  If the connect
  fails, it should try another mother machine until it finds one.  

  Once connected it uses pg_lookup to lookup the SCHEDULER process group.
  Because there will be hundreds of CYCLE-SERVERS you can't use pg_client
  in the present version of ISIS, but starting in fall you'll do so and
  everything will be by BYPASS communication.  Anyhow, having looked up
  the group it sends a message to register itself as ready to run tasks.

  It will now periodically receive messages from the SCHEDULER asking it 
  to do simulation tasks.  When it finishes them, it sends back a message
  indicating that it is done and idle again.  Don't use RPC for this (the
  memory associated with having hundreds of active subtasks could get
  to be a problem).  Just use asynchronous IPC.

SCHEDULER GROUP:
  The scheduler group receives register messages and adds the server to its
  processor pool.  It also receives new simulation task descriptions and
  adds these to a "pending task pool".  Its scheduler subtask works by picking
  a task from the task pool and assigning it to an idle cycle server.
  It then marks the server as busy and sends it the message that fires off
  the simulation job. 

  The server would use replicated data to maintain the information about
  who is doing what.  The solution is like the one I wrote up in the
  book on Distributed Computing.  It isn't hard but if there is a slightly
  tricky aspect in this whole solution, this would be it...  The issue
  is to make sure that all the scheduler processes can deduce which 
  task went to which server.  One example of a way to solve this is to
  use ABCAST to send in the registration requests for cycle servers and
  for new tasks, in which case all schedulers can run the same algorithm
  to figure out who does what, and the mother scheduler for a specific
  server can be responsible for interacting with it.  A CBCAST solution
  is only slightly harder; you use a token holder who decides who should do
  each task and periodically sends out messages telling the others what
  it decided.

  The mother-machines' SCHEDULER uses proc_watch to watch its child
  SERVERS and broadcasts a message to drop them from its server pool if
  they fail or drop out because the owner of the workstation has returned.

FRONT END:
  A front-end program would maintain an interactive status display and
  support a simple command set associated with starting and cancelling jobs.
  If you want to get fancy, it could easily handle resource limits too,
  so that infinite loops can be detected and the program killed if necessary.

COMMENTS:
  Some comments: notice that the cycle servers are not members of a 
  process group.  They just run stand-alone.  The only process group is
  the scheduler group, and this shouldn't be too big (4-6 is a good size).
  Also, note that most of ISIS is confined to the 6 mother machines.  The
  other 600 all use isis_remote to connect in and don't have anything
  special on them except our "cycle server" program, which is spun off
  by /etc/init.

  If your system is heterogeneous there are some things to keep in mind:
  first, the scheduler will need to know enough about the configuration
  needs of a job to make sureit places it on an appropriate machine.  Also,
  there may be issues of file placement and access to other sorts of
  resources that get tricky here.  Oh, and one final caution: isis_remote
  has a bug in V2.0 and won't run between machines with incompatible byte
  ordering.  This is fixed in V2.1.

  Whether or not you compile with BYPASS communication the above will run
  quite fast.  Later this summer when pg_client runs in bypass mode,
  you can just add one of these calls, recompile, benefit from it immediately.

  Further reading: the CONDOR system from Wisconsin does something like this
  but has a very fancy (and costly) approach to IO; I wouldn't do what they
  do, because I believe in risc software.  AT&T has a system called TUXEDO
  for this sort of thing and HP and SUN probably have products too.  No idea
  how good these products are.

If I were building this (and I know ISIS very well) I would estimate that
the cycle server program should require about 1-5 pages of C code to develop
and the job scheduler another 4 or 5 pages of code.  I could get this written
and running in a few days, maximum.  If you want to do a really spiffy job
with heterogeneous resources, a fancy display and so forth, of course, this
could easily stretch into a much harder job.  We'll probably include something
like this in the ISIS demos in the future.  If someone out there builds one,
please consider letting us know.

Ken