[comp.unix.internals] restarting processes

vixie@wrl.dec.com (Paul Vixie) (09/04/90)

I'd like to do this also.  But if your process has pipes open to other
processes, then those other processes would have to be restarted in the
same state if your process was to be restarted "correctly".  If you had
files open, those same files would have to be there when you restarted,
with the same contents.  If you had a physical device file open, the
results could be confusing (let's say someone else dismounts your tape
and mounts one of their own -- can you get your tape back to the same
"state" it was in when you restart your program?).  And of course, if
you had any network connections open, then all of this stickiness extends
to whatever processes you're talking to on (the) remote machine(s).

This kind of restartability wasn't on the UNIX designers' minds, and the
system call interface has absolutely no architectural support for it.
The thing you're trying to do is usually done at the application layer,
as in "commit" operations in databases, and like that.

If all you really want to do is stop for a backup, then "kill -STOP" will
work (unless you're burdened with some offbrand kernel that doesn't have
job control).  You might also be able to get some mileage out of "undump",
which is subject to the restrictions noted above but it's at least something.

The Sprite operating system has something called "process migration", but
as far as I know a migrating process' locks on system resources are not
released during migration, just lifted up and punched down elsewhere --
this restriction makes everything easy, since all your network connections
and file pointers and so on just stay open while your process moves to
some other CPU.
--
Paul Vixie
DEC Western Research Lab	<vixie@wrl.dec.com>
Palo Alto, California		...!decwrl!vixie

zeke@shamash.cdc.com (Robert Scott) (09/04/90)

In article <1990Sep3.235815.17361@wrl.dec.com>, vixie@wrl.dec.com (Paul Vixie) writes:
> I'd like to do this also.  But if your process has pipes open to other
> processes, then those other processes would have to be restarted in the
> same state if your process was to be restarted "correctly".  If you had
> files open, those same files would have to be there when you restarted,
> with the same contents.  If you had a physical device file open, the
> results could be confusing (let's say someone else dismounts your tape
> and mounts one of their own -- can you get your tape back to the same
> "state" it was in when you restart your program?).  And of course, if
> you had any network connections open, then all of this stickiness extends
> to whatever processes you're talking to on (the) remote machine(s).
> 
> This kind of restartability wasn't on the UNIX designers' minds, and the
> system call interface has absolutely no architectural support for it.
> The thing you're trying to do is usually done at the application layer,
> as in "commit" operations in databases, and like that.
> 
> Stuff deleted...

On most Control Data machines running NOS or NOS/VE, and the old Cyber 205
supercomputer, there is a facility called "checkpointing" the system.  When
the operator does this, the state of all running processes are saved complete
with open file info and everything.  After the checkpoint, the system can be
brought down for maintenance or whatever, and then restored to the initial
running state by reloading the system and going through a "restart" process
to reload and restore the executing jobs.

I believe that on the Cyber 205 we could also checkpoint individual jobs.  Big
difference between UNIX and VSOS (205 OS) though, was that each 205 job is
almost always a single process unless it is a system task.

As Paul writes above, UNIX contains many possible problems to this kind of
operation.  Remember, UNIX was written basically as a small computer OS for
interactive access, and wasn't originally intended to be running weather 
models or other large programs that might have to run on a supercomputer 
for 24 hours before completing.  On the large mainframes, particularly in 
the scientific computing arena, huge data reduction or repetative calculation
are the norm, as is batch input/output.  Just as a course of normal operations
in these giant pieces of iron, programs and entire OS states need to be saved
so that the machine can be serviced or a higher priority program run.

Checkpoint on UNIX would be a nice idea, though.


Zeke

~~~~~~~~~~~ From the Shrine of the "Last Gasp of ETA Systems" ~~~~~~~~~~~~~
Extra zesty disclaimer:  MINE! MINE! ALL MINE! <chortle snort froth drool>
Robert K. "Zeke" Scott        internet: zeke@eta.cdc.com
Control Data Corp, Supercomputer Support Group
-- 
~~~~~~~~~~~ From the Shrine of the "Last Gasp of ETA Systems" ~~~~~~~~~~~~~
Extra zesty disclaimer:  MINE! MINE! ALL MINE! <chortle snort froth drool>
Robert K. "Zeke" Scott        internet: zeke@eta.cdc.com
Control Data Corp, Supercomputer Support Group