[comp.unix.wizards] Process restart.

ag@elgar.UUCP (Keith Gabryelski) (11/10/88)

[Follow up to .wizards --kmg]

In article <8831@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
>In article <16@elgar.UUCP> ag@elgar.UUCP (Keith Gabryelski) writes:
>>How does one stop a process in a way that it can be restarted after a
>>cold boot?
>
>You obviously can't,

Well, it would seem to me that it would require some kernel hacks, but
is at least feasible.

	Freeze the process.

	Save the relevent information to a file including proc entry,
	user info, file discriptor information, blah, blah, blah.

	Remove the process.

Restoring would have to take into account resources that may no longer
be availble, but it is at least doable, eh?

>in general.

What does `in general' mean?
-- 
ag@elgar.CTS.COM         Keith Gabryelski          ...!{ucsd, jack}!elgar!ag

gwyn@smoke.BRL.MIL (Doug Gwyn ) (11/11/88)

In article <17@elgar.UUCP> ag@elgar.UUCP (Keith Gabryelski) writes:
>>in general.
>What does `in general' mean?

In general, a process is quite a dynamic thing, particular with
respect to its interaction with its environment.  For example,
what is currently displayed on one or more terminal screens
might be important, or input may be coming from another process
via a pipe, etc.  The total amount of information necessary to
restore all relevant factors upon restart is impractical for the
general case.  Only rather simple-minded uses of processes can
be properly restarted from snapshots.

To take a specific example, I defy you to restart a snapshot
of the "layers" program using any general-purpose mechanism.

peter@ficc.uu.net (Peter da Silva) (11/12/88)

In article <16@elgar.UUCP> ag@elgar.UUCP (Keith Gabryelski) writes:
>How does one stop a process in a way that it can be restarted after a
>cold boot?

You can't, in the general case. However it's quite feasible to make a
process snapshot itself so that it can be restarted, so long as it's
willing to re-open its files after restart. This is how the old PDP-11
Adventure program worked. I wrote a variant on this to save precompiled
FORTH executables after porting John James' PDP-11 Forth to Version 7.

Basically, you need to (1) have an 'I am restarting' flag, and...

static int restarted = 0;
void (*restart_func)();
char *restart_files[_NFILE];
char *restart_mode[_NFILE];
long restart_offset[_NFILE];

main(ac, av)
int ac;
char **av;
{
	if(restarted) {
		set restarted to 0.
		re-fopen files, and seek to the saved offsets.
		call saved restart_func.
	}

	...

}

snapshot(func)
void (*func)();
{
	if(!fork()) {
		restarted = 1;
		restart_func = func;
		abort();
	}
	if(!(fp = fopen("core", "r+"))) {
		complain();
		return FAIL;
	}
	convert core header to an a.out header, and set data end
		to _end. 
	fclose(fp);
}
-- 
Peter da Silva  `-_-'  Ferranti International Controls Corporation
"Have you hugged  U  your wolf today?"     uunet.uu.net!ficc!peter
Disclaimer: My typos are my own damn business.   peter@ficc.uu.net

ag@elgar.UUCP (Keith Gabryelski) (11/12/88)

In article <8857@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
>Only rather simple-minded uses of processes can
>be properly restarted from snapshots.

Snapshots are not the only mean of process restart, but they are the
most likely for what I was thinking.

>To take a specific example, I defy you to restart a snapshot
>of the "layers" program using any general-purpose mechanism.

I doubt a shell is something some one would want to restart (although
migrate is a different matter).  You would probably want some of the
proceses that are running under the shell, though.

Long running procesess that don't have any means of shutdown/restart
built into them are what I am thinking of.

Let's say we have this process computing prime numbers (or some other
simple case) and the system needs to be shutdown because of some fatal
error.  Can a snapshot be done?
-- 
ag@elgar.CTS.COM         Keith Gabryelski          ...!{ucsd, jack}!elgar!ag

bsy@PLAY.MACH.CS.CMU.EDU (Bennet Yee) (11/13/88)

In article <18@elgar.UUCP> ag@elgar.UUCP (Keith Gabryelski) writes:
>
>Long running processes that don't have any means of shutdown/restart
>built into them are what I am thinking of.
>
>Let's say we have this process computing prime numbers (or some other
>simple case) and the system needs to be shutdown because of some fatal
>error.  Can a snapshot be done?

I've done exactly this about two years ago.  My implementation of
M.O.Rabin's probabilistic primality test ran for about a week of real time
on a uVax II surviving multiple reboots/system crashes before finding a 1000
digit probabilistic prime....  I don't know how much real CPU time it took
-- the machine was a general purpose machine (I ran my program niced 19) and
I didn't keep track of timing info.  In retrospect it would have been easy:
I had it checkpoint every 5 minutes of CPU time anyway, so all I needed to
do is to increment a counter.  Anyway, since the program's I/O behavior is
very simple (it generated output only just before completing, and I only
redirected its stdout to a file), it was particularly simple to checkpoint
the process.

I thought about the case of replacing open/close with library routines and
syscall'ing the traps after saving state; at a checkpoint, we can lstat the
known descriptors so we can restore.  This would work only for files, of
course, and I didn't bother.  I may do this at a later date....

The code that I _do_ have simply checkpoints the data/stack portion of the
address space.  Note that this includes the stdio buffers etc, so if I _did_
decide to save file descriptor states all I need to do at restart is to
lseek to the old location... assuming the program doesn't lseek around also.
If it did, I'd have to copy all the files to get _their_ state at the time
of the checkpoint (bleh).  Restart is performed by running the program with
a switch specifying the checkpoint file, whereupon the state from the file
is loaded into the current address space (i.e, your program would have to
recognize a flag and call my restore function).  I have versions of this
code running on Vaxen and IBM RTs.

I currently have 3 1000 digit probabilistic primes.  Does any factoring
wizard want a 2000 digit compos... :-)

To generate 100 digit probabilistic primes (probability 1 - 2^-40), it takes
129.3u 0.7s 2:28 87% on an IBM RT/APC and 290.2u 0.1s 8:49 54% on a uVax III.

The primality code uses the cmump library package developed here at CMU
(cmump is based on the mp package from BTL), so probably won't be useful
unless you have source license or you're willing to rewrite it.  As for the
checkpointing code, I'm willing [and able] to share.  I only use Unix
syscalls and the code should have no Mach dependencies.

-bsy
-- 
Internet:	bsy@cs.cmu.edu		Bitnet:	bsy%cs.cmu.edu%smtp@interbit
CSnet:	bsy%cs.cmu.edu@relay.cs.net	Uucp:	...!seismo!cs.cmu.edu!bsy
USPS:	Bennet Yee, CS Dept, CMU, Pittsburgh, PA 15213-3890
Voice:	(412) 268-7571
-- 

gwyn@smoke.BRL.MIL (Doug Gwyn ) (11/14/88)

In article <18@elgar.UUCP> ag@elgar.UUCP (Keith Gabryelski) writes:
>Let's say we have this process computing prime numbers (or some other
>simple case) and the system needs to be shutdown because of some fatal
>error.  Can a snapshot be done?

Well, now that you're restricting yourself to the "doable" cases,
these are the sort of programs for which I make them periodically
write out useful intermediate stuff that they can later use to
resume, PORTABLY without needing any special help from the system,
linker, etc.  In fact I had one of these I ran at night on Rice's
IBM 1620 back in 1968; when someone else wanted to use the machine
he could just flip a sense switch and the program would soon punch
out an intermediate set of data that I used next time to continue
the job where it had been interrupted.  Ah, for the good old days.

jerryp@cmx.npac.syr.edu (Jerry Peek) (11/28/88)

In article <16@elgar.UUCP> ag@elgar.UUCP (Keith Gabryelski) writes:
>How does one stop a process in a way that it can be restarted after a
>cold boot?

There are some interesting papers at the end of the Winter '88 USENIX
Proceedings -- pages 357 on -- that cover some similiar things.

--Jerry Peek, Northeast Parallel Architectures Center, Syracuse, NY
  jerryp@cmx.npac.syr.edu
  +1 315 443-1722