ag@elgar.UUCP (Keith Gabryelski) (11/10/88)
[Follow up to .wizards --kmg] In article <8831@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes: >In article <16@elgar.UUCP> ag@elgar.UUCP (Keith Gabryelski) writes: >>How does one stop a process in a way that it can be restarted after a >>cold boot? > >You obviously can't, Well, it would seem to me that it would require some kernel hacks, but is at least feasible. Freeze the process. Save the relevent information to a file including proc entry, user info, file discriptor information, blah, blah, blah. Remove the process. Restoring would have to take into account resources that may no longer be availble, but it is at least doable, eh? >in general. What does `in general' mean? -- ag@elgar.CTS.COM Keith Gabryelski ...!{ucsd, jack}!elgar!ag
gwyn@smoke.BRL.MIL (Doug Gwyn ) (11/11/88)
In article <17@elgar.UUCP> ag@elgar.UUCP (Keith Gabryelski) writes: >>in general. >What does `in general' mean? In general, a process is quite a dynamic thing, particular with respect to its interaction with its environment. For example, what is currently displayed on one or more terminal screens might be important, or input may be coming from another process via a pipe, etc. The total amount of information necessary to restore all relevant factors upon restart is impractical for the general case. Only rather simple-minded uses of processes can be properly restarted from snapshots. To take a specific example, I defy you to restart a snapshot of the "layers" program using any general-purpose mechanism.
peter@ficc.uu.net (Peter da Silva) (11/12/88)
In article <16@elgar.UUCP> ag@elgar.UUCP (Keith Gabryelski) writes: >How does one stop a process in a way that it can be restarted after a >cold boot? You can't, in the general case. However it's quite feasible to make a process snapshot itself so that it can be restarted, so long as it's willing to re-open its files after restart. This is how the old PDP-11 Adventure program worked. I wrote a variant on this to save precompiled FORTH executables after porting John James' PDP-11 Forth to Version 7. Basically, you need to (1) have an 'I am restarting' flag, and... static int restarted = 0; void (*restart_func)(); char *restart_files[_NFILE]; char *restart_mode[_NFILE]; long restart_offset[_NFILE]; main(ac, av) int ac; char **av; { if(restarted) { set restarted to 0. re-fopen files, and seek to the saved offsets. call saved restart_func. } ... } snapshot(func) void (*func)(); { if(!fork()) { restarted = 1; restart_func = func; abort(); } if(!(fp = fopen("core", "r+"))) { complain(); return FAIL; } convert core header to an a.out header, and set data end to _end. fclose(fp); } -- Peter da Silva `-_-' Ferranti International Controls Corporation "Have you hugged U your wolf today?" uunet.uu.net!ficc!peter Disclaimer: My typos are my own damn business. peter@ficc.uu.net
ag@elgar.UUCP (Keith Gabryelski) (11/12/88)
In article <8857@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes: >Only rather simple-minded uses of processes can >be properly restarted from snapshots. Snapshots are not the only mean of process restart, but they are the most likely for what I was thinking. >To take a specific example, I defy you to restart a snapshot >of the "layers" program using any general-purpose mechanism. I doubt a shell is something some one would want to restart (although migrate is a different matter). You would probably want some of the proceses that are running under the shell, though. Long running procesess that don't have any means of shutdown/restart built into them are what I am thinking of. Let's say we have this process computing prime numbers (or some other simple case) and the system needs to be shutdown because of some fatal error. Can a snapshot be done? -- ag@elgar.CTS.COM Keith Gabryelski ...!{ucsd, jack}!elgar!ag
bsy@PLAY.MACH.CS.CMU.EDU (Bennet Yee) (11/13/88)
In article <18@elgar.UUCP> ag@elgar.UUCP (Keith Gabryelski) writes: > >Long running processes that don't have any means of shutdown/restart >built into them are what I am thinking of. > >Let's say we have this process computing prime numbers (or some other >simple case) and the system needs to be shutdown because of some fatal >error. Can a snapshot be done? I've done exactly this about two years ago. My implementation of M.O.Rabin's probabilistic primality test ran for about a week of real time on a uVax II surviving multiple reboots/system crashes before finding a 1000 digit probabilistic prime.... I don't know how much real CPU time it took -- the machine was a general purpose machine (I ran my program niced 19) and I didn't keep track of timing info. In retrospect it would have been easy: I had it checkpoint every 5 minutes of CPU time anyway, so all I needed to do is to increment a counter. Anyway, since the program's I/O behavior is very simple (it generated output only just before completing, and I only redirected its stdout to a file), it was particularly simple to checkpoint the process. I thought about the case of replacing open/close with library routines and syscall'ing the traps after saving state; at a checkpoint, we can lstat the known descriptors so we can restore. This would work only for files, of course, and I didn't bother. I may do this at a later date.... The code that I _do_ have simply checkpoints the data/stack portion of the address space. Note that this includes the stdio buffers etc, so if I _did_ decide to save file descriptor states all I need to do at restart is to lseek to the old location... assuming the program doesn't lseek around also. If it did, I'd have to copy all the files to get _their_ state at the time of the checkpoint (bleh). Restart is performed by running the program with a switch specifying the checkpoint file, whereupon the state from the file is loaded into the current address space (i.e, your program would have to recognize a flag and call my restore function). I have versions of this code running on Vaxen and IBM RTs. I currently have 3 1000 digit probabilistic primes. Does any factoring wizard want a 2000 digit compos... :-) To generate 100 digit probabilistic primes (probability 1 - 2^-40), it takes 129.3u 0.7s 2:28 87% on an IBM RT/APC and 290.2u 0.1s 8:49 54% on a uVax III. The primality code uses the cmump library package developed here at CMU (cmump is based on the mp package from BTL), so probably won't be useful unless you have source license or you're willing to rewrite it. As for the checkpointing code, I'm willing [and able] to share. I only use Unix syscalls and the code should have no Mach dependencies. -bsy -- Internet: bsy@cs.cmu.edu Bitnet: bsy%cs.cmu.edu%smtp@interbit CSnet: bsy%cs.cmu.edu@relay.cs.net Uucp: ...!seismo!cs.cmu.edu!bsy USPS: Bennet Yee, CS Dept, CMU, Pittsburgh, PA 15213-3890 Voice: (412) 268-7571 --
gwyn@smoke.BRL.MIL (Doug Gwyn ) (11/14/88)
In article <18@elgar.UUCP> ag@elgar.UUCP (Keith Gabryelski) writes: >Let's say we have this process computing prime numbers (or some other >simple case) and the system needs to be shutdown because of some fatal >error. Can a snapshot be done? Well, now that you're restricting yourself to the "doable" cases, these are the sort of programs for which I make them periodically write out useful intermediate stuff that they can later use to resume, PORTABLY without needing any special help from the system, linker, etc. In fact I had one of these I ran at night on Rice's IBM 1620 back in 1968; when someone else wanted to use the machine he could just flip a sense switch and the program would soon punch out an intermediate set of data that I used next time to continue the job where it had been interrupted. Ah, for the good old days.
jerryp@cmx.npac.syr.edu (Jerry Peek) (11/28/88)
In article <16@elgar.UUCP> ag@elgar.UUCP (Keith Gabryelski) writes: >How does one stop a process in a way that it can be restarted after a >cold boot? There are some interesting papers at the end of the Winter '88 USENIX Proceedings -- pages 357 on -- that cover some similiar things. --Jerry Peek, Northeast Parallel Architectures Center, Syracuse, NY jerryp@cmx.npac.syr.edu +1 315 443-1722