[comp.unix.questions] Checkpointing and the Rollback of Processes

rside@uvicctr.UUCP (Robert Side) (08/29/88)

First of all. I would like to *thank* all the people that responded
to my problem. I tried to reply to everyone but I guess I have
not mastered the mailing program on our system yet.

Second, I have another problem concerning this problem which I
will post in another article.

I originaly wrote on checkpointing and the rollback of processes

> I have a problem I hope somebody can help me with.
> 
> Long Summary:
> I would like to be able to *checkpoint* a running process
> so that the process, which is under user control, can be rollbacked to a
> given checkpoint and restarted.
> 
> My idea to solve the problem:
> The way I have been thinking to solve the problem is to save
> the process's data, stack and registers when a checkpoint
> occurs and when the user rollbacks the process, the saved
> data, stack, and registers are copied into the process's memory
> image and hopefully the process will think it is back at the time
> the checkpointed was taken.
> 
> Cravats:
> Sun-3 workstations running Sun UNIX 4.2 Release 3.3. There will
> be open files as well as open sockets. The ptrace system call can
> be used.
> 
> What I need help with:
> I would like to know if the problem can be solved,
> what literature (if any) has been written on the above problem,
> what problems will arise, and, MOST OF ALL, how to do it.
> 
> Please email responses (But I do read these groups) and I will
> summarize.
> 
> *MANY* thanks in advance and any help will be greatly appreciated.
> 
> Rob Side
> 
> Robert Side <rside@uvunix.uvic.cdn>
> UUCP:	...!{ubc-vision,uw-beaver,ssc-vax}!uvicctr!rside
> BITNET:	rside@uvunix.bitnet

--------------------
Jeff Woolsey <uw-beaver!ames!ucbcad!nsc.NSC.COM!woolsey> writes

You've neglected one biggie: open files, and their positions.  Another,
not quite so biggie: process environment (particularly the current
working directory, if the process has written out files it will later
want to read).

Of course, if the checkpoint is handled by the program itself, it can
make sure that it happens and a good time (no open files, etc).  If the
checkpoint is handled by something external, so that you could use it
to checkpoint ANYTHING (except programs running with privilege), you'll
have to worry about all this stuff.

Good luck.

Jeff Woolsey  woolsey@nsc.NSC.COM  -or-  woolsey@umn-cs.cs.umn.EDU

--------------------
uunet!jetson.UPMA.MD.US!john (John Owens)  writes

Check out the undump mechanism used in GNU Emacs.  It writes an
executable image of the current process.  It's used to turn certain
pre-loaded data into shared read-only text, but you could adapt it to
your uses.  The only problem is knowing what your open files are.  If
you are able to, you could set a flag in the dumped image that your
program will read on start, and it will reopen the files, fix the
stack, and do a longjmp to a setjmp that you've stored before the
undump.  You can also do an ftell on all the files during the
checkpoint and lseek during the restore....

Good luck!
---
John Owens		john@jetson.UPMA.MD.US
SMART HOUSE L.P.	uunet!jetson!john		(old uucp)
+1 301 249 6000		john%jetson.uucp@uunet.uu.net	(old internet)

--------------------
uunet!unisoft!cander (Charles Anderson) writes

I will assume that you don't care about files being changed.  Rolling
them back (without just copying them) could be a problem without some
help from the O.S.  Here's a simple solution that the 4.2 dump program
uses: fork and let the child do the work/transaction.  If you need to
rollback, just have the child exit.  The parent is then in exactly the
same state as when the "checkpoint" happened.  Dump uses this to deal
with potential tape problems.  You could do any number of forks (up to
the per users process limit) to maintain any number of current
checkpoints.  To roll forward or "commit the transaction" you could
signal the parent(s) and have him/her/them exit.  I realize it's kind
of quick and dirty and it may be expensive if the process is big, but
it will work.

Otherwise, you could try to write the whole data segment out to disk to
checkpoint and do a setjmp().  Then to rollback, you could read the
data segment back in and longjmp().  I don't know if it would work, but
it sounds good.

Let me know what you decide on.  It sounds like an interesting
problem.

Charles.  {sun,uunet,ucbvax,pyrmaid}!unisoft!cander

--------------------
uunet!dalsqnt!vector!chip (Chip Rosenthal) Writes

>The way I have been thinking to solve the problem is to save
>the process's data, stack and registers when a checkpoint
>occurs

Setjmp/longjmp does this for the stack and registers.
---
Chip Rosenthal     chip@vector.UUCP | I've been a wizard since my childhood.
Dallas Semiconductor   214-450-0486 | And I've earned some respect for my art.

--------------------
der Mouse  <mcgill-vision!uunet!Larry.McRCIM.McGill.EDU!mouse> writes

I implemented something similar once.  What I did was to checkpoint a
process into a file for later resumption, but the constraints were
somewhat different.  In particular, the whole point was to be able to
restore a simulatior run after a crash, which makes restoring open
files and so on effectively impossible.  This is the difficult part of
this: open files.  My "solution" was to force the program to close all
files before checkpointing; this was feasible in our case.

Have you considered forking and letting one process run on, with the
"resumption" consisting of switching to the other process?  Depending
on what you want, this might be good enough.

Doing this would involve just adding two syscalls, one to dump a
process and one to restore it.  Yes, it's possible.  I wouldn't attempt
it without kernel source, but then I get very dogmatic about having
source.  I'd be glad to send you the code I have for dumping and
restoring later, in another process, though it won't be directly useful.

					der Mouse

			old: mcgill-vision!mouse
			new: mouse@larry.mcrcim.mcgill.edu

----------------

Again thanks to those that replied

Rob Side
-- 
Robert Side <rside@uvunix.uvic.cdn>
UUCP:	...!{ubc-vision,uw-beaver,ssc-vax}!uvicctr!rside
BITNET:	rside@uvunix.bitnet

laman@ivory.SanDiego.NCR.COM (Mike Laman) (08/31/88)

In article <484@uvicctr.UUCP> rside@uvicctr.UUCP (Robert Side) writes:
>First of all. I would like to *thank* all the people that responded
>to my problem. I tried to reply to everyone but I guess I have
>not mastered the mailing program on our system yet.
	:
	:
	:
>I originaly wrote on checkpointing and the rollback of processes
>
>> I have a problem I hope somebody can help me with.
>> 
>> Long Summary:
>> I would like to be able to *checkpoint* a running process
>> so that the process, which is under user control, can be rollbacked to a
>> given checkpoint and restarted.
>> 
	:
	[ Deleted the rest of his "original" message ]
	[ Deleted a couple messages Robert included ]
	:

>uunet!jetson.UPMA.MD.US!john (John Owens)  writes
>
	:
	[ Deleted one suggestion from John's message to Robert ]
	:
>Otherwise, you could try to write the whole data segment out to disk to
>checkpoint and do a setjmp().  Then to rollback, you could read the
>data segment back in and longjmp().  I don't know if it would work, but
>it sounds good.
>
	:
	[ Deleted a couple messages Robert included ]
	:
I just wanted to add my two cents worth on the subject of writing out
an arbitrary area of data in one process and reading it back in in
another process in the future.  It is possible.  Afterall, that's
how "rogue" saves a game.  I just wanted to warn you of a nonobvious
problem you can encounter.  If the area you are saving contains various
stdio library data and you use the stdio library for writing out the data,
you will have a problem.  When you write out the ``_iob'' table, it will
show that a slot (among others, of course) is in use.  Namely, the one
you're using to write out the data.  Eventually you'll finish writing out the
data and (as a good programmer :-)) ``fclose()'' the file.  Well, that
frees the stdio ``_iob[]'' slot and closes the file descriptor, but be
careful, when you read the data in later (in someother process probably).
The ``_iob[]'' slot was "open" at the time the data was saved.  After
you have restored all the data, you need to "fclose()" that once open
stream (which is really closed) used to write out the data, so
you can free up the slot.  Otherwise, each time you restore from a saved
image, you'll keep eating up a stdio ``_iob[]'' slot.  On many systems
you'll get to save the image about 17 times (20 - 3 (stdin, stdout, stderr)).
Then your "fopen()"'s will fail because your ``_iob[]'' table is full.

And don't worry, I'm not even going to mention the lack of portability
for systems with non contiguous data space.  Hmmm.  I guess I did.

Mike Laman

P.S.  When you think about this you really start to worry about the guts
of various libraries with their static (only once) initialized data.
You'd better hope they are initialized properly.  Example: terminfo curses -
don't play a restored game of rogue on a different terminal!  The interanlly
static data is for the original terminal type!  You're getting into a
headache, generally speaking with this approach.