[comp.unix.wizards] Checkpoint/Restart

fay@ksr.UUCP (Peter Fay) (05/08/90)

Does anyone out there have any experience using checkpoint/restart as a user
or coding it in the O.S.? I know Unicos claims to have that facility and have
read the Unicos manual, but I know little about how it or any other Unix
checkpoint is implemented.

What are the issues that come up when putting this into the O.S. (esp. Unix)?
What capabilities do users really want and need?
Does checkpoint save only files that are currently opened or all that have
been accessed in the program to date?
What is a Unicos process "category" and how does that relate to pid and
chkpnt?
Is there a runtime environment that does all the chkpnt/restart for the
applications or is it a roll-your-own situation?

-pete fay
{harvard.harvard.edu!ksr!fay}

purdon@athena.mit.edu (James R Purdon) (05/09/90)

In article <646@ksr.UUCP> fay@ksr.UUCP (Peter Fay) writes:
>Does anyone out there have any experience using checkpoint/restart as a user
>or coding it in the O.S.? I know Unicos claims to have that facility and have
>read the Unicos manual, but I know little about how it or any other Unix
>checkpoint is implemented.

I've had experience using the checkpoint facility, both at the command level
and the subroutine level.  There can be problems if files required by the
checkpointed job are changed or if the checkpointed job does not own all
of its files (in the case of NQS jobs), but within these limits it appears to
be a reliable utility (warning: my view is biased).

>What are the issues that come up when putting this into the O.S. (esp. Unix)?

I don't know the answer to this, but my guess is that you have to worry about
saving an image of memory in a file in such a way that you can load it back
into memory and start right where you left off.

>What capabilities do users really want and need?

Users want a seamless, transparent, painless facility.

>Does checkpoint save only files that are currently opened or all that have
>been accessed in the program to date?

It does not even save files - just the buffers and positional pointers.  If
your files change, no more checkpoint.  This does not seem to be a problem
for most of the user programs I've encountered.

>What is a Unicos process "category" and how does that relate to pid and
>chkpnt?

I'm not sure what you mean by this.  In addition to pids, UNICOS supports
"jobs" - the login process and it children, which has a jid.  Either pids
or jids may be used in checkpoint requests.

>Is there a runtime environment that does all the chkpnt/restart for the
>applications or is it a roll-your-own situation?

The NQS batch system can issue its own checkpoint commands, but the user
can also issue checkpoint requests (which the NQS batch system will honor).
In this situation, restart only occurs after a crash and is done automatically
by NQS.

In the interactive environment, the users are on their own.  This includes
processes placed into the background by nohup, at, and & (but not NQS
jobs).  Checkpoint and restart requests may be issued at will, and are
available at both the command line and subroutine level.

While you certainly could automatically checkpoint interactive processes
from some central process, I wouldn't recommend it for obvious reasons.

Jim

--
James Purdon
purdon@cons1.mit.edu

mdelany%hbapn1.prime.com@relay.cs.net ( Mark Delany) (08/16/90)

Mark Holcomb <mth@ROLF.STAT.UGA.EDU> writes:

> I've felt the need for need for a new tool that Sun doesn't have.

> Ever have a process that's been running for six weeks, and will
> need another week to finish when you need to make level 0 backups
> or would like to shut the computer down for a bad storm.

> I need a tool that would stop a running process and let it be
> restarted at a later date.

> I've thought of a couple of ways it might be done:

[A number of suggestions deleted]

...

A general solution would have to re-establish and re-position all open
files, sockets, message queues, pipes, semphores, shared-memory segments,
environment variables, (add your favourite externally visible entity) to
exactly the same state as they were previously.

Once you've done this, it's a simple matter of re-constructing your memory
image.

Finally, you have to hope that none of the code in your program has
stashed the PID or date away in memory somewhere as these may be different
when you next restart the prog :-)

Seriously, doing this in any substantive manner is difficult and I'm sure
it would be virtually impossible to bullet-proof it on UNIX.

When confronted with this array of problems, most people opt for
individualized, per-program solutions for those progs that run for long
periods.


Mark D.

mwm@decwrl.dec.com (Mike) (08/17/90)

>> Mark Holcomb <mth@ROLF.STAT.UGA.EDU> writes:
>> > I need a tool that would stop a running process and let it be
>> > restarted at a later date.
>> 
>> [problems deleted]
>> Seriously, doing this in any substantive manner is difficult and I'm sure
>> it would be virtually impossible to bullet-proof it on UNIX.

Yes - but you don't need it bullet-proofed; you just need it to work
most of the time. After all, being able to restart 90% of the time is
much better than being able to restart 0% of the time. Other OSs
provide this facility (or similar ones) in the face of these
difficulties; Unix ought to be able to.

In fact, if I recall correctly, UniCOS (the Cray SysV port) does
provide a checkpoint facility, for exactly the kind of long-running
processes that Mark was asking for it for.

Why does this line come to mind: "Do the easy 90% and give it to the
users; do the hard 10% only if they then ask for it."

	<mike

langley@ds1.scri.fsu.edu (Randolph Langley) (08/17/90)

There is a paper, "Job and Process Recovery In A UNIX-based Operating
System", by Brent Kingsbury and John Kline, that talks about UNICOS's
checkpointing/restarting capabilities. It is available in the Cray
documentation distribution, and I would guess directly from Cray.

I also note that the authors have e-mail addresses: they are
brent@yafs.cray.com and jtk@hall.cray.com.

rdl

dave@csd4.csd.uwm.edu (Dave Rasmussen) (08/17/90)

From article <24193@adm.BRL.MIL>, by mwm@decwrl.dec.com (Mike):
> 
> In fact, if I recall correctly, UniCOS (the Cray SysV port) does
> provide a checkpoint facility, for exactly the kind of long-running
> processes that Mark was asking for it for.
> 
> Why does this line come to mind: "Do the easy 90% and give it to the
> users; do the hard 10% only if they then ask for it."
> 
Convex mentions this may appear in their next release 4th qtr or so as well.

--
Internet:dave@uwm.edu, Uucp:uwm!dave, Bitnet:dave%uwm.edu@INTERBIT
AT&T:414-229-5133 USnail:Dave Rasmussen-CSD,Box 413 EMS380,Milwaukee,WI 53201

gwyn@smoke.BRL.MIL (Doug Gwyn) (08/18/90)

In article <24193@adm.BRL.MIL> mwm@decwrl.dec.com (Mike) writes:
>>> > I need a tool that would stop a running process and let it be
>>> > restarted at a later date.
>>> Seriously, doing this in any substantive manner is difficult and I'm sure
>>> it would be virtually impossible to bullet-proof it on UNIX.
>Yes - but you don't need it bullet-proofed; you just need it to work
>most of the time. After all, being able to restart 90% of the time is
>much better than being able to restart 0% of the time. Other OSs
>provide this facility (or similar ones) in the face of these
>difficulties; Unix ought to be able to.

Other operating systems do not have the rich process environment
that UNIX provides.  If there are only a small number of things that
need to be straightened out in a batch-processing environment, then
system-provided checkpointing is feasible.

>Why does this line come to mind: "Do the easy 90% and give it to the
>users; do the hard 10% only if they then ask for it."

Why does the thought come to mind "anyone whose application requires
only a 90% chance of executing successfully shouldn't be using the
computer at all"?

Any application that is EXPECTED to run for a long time should have
interruptibility features built into it.  I did this back in 1967,
and have little sympathy for people who are too lazy to deal with it.

gkn@ucsd.Edu (Gerard K. Newman) (08/18/90)

In article <13611@smoke.BRL.MIL> gwyn@smoke.BRL.MIL (Doug Gwyn) writes:
>Any application that is EXPECTED to run for a long time should have
>interruptibility features built into it.  I did this back in 1967,
>and have little sympathy for people who are too lazy to deal with it.

True enough, but a minor nit:  suppose I am a more-or-less non-computer
literate type, who is using some canned commercial software (pick your
own favorite package -- there are lots of them) to do some lengthy
calculation.  In this case, it would be a real plus for the operating
system to provide some easy (even automatic) means for periodic
checkpointing of the job state.  Such systems exist, and many have
existed for quite some time.

I think it's a bit unfair for every user of a system to have to
invent a way to do this specific to their particular application.
In many cases it may not be possible (the above "canned software"
problem being an example).

I agree that adding this capability to many varieties of Unix may
require much skull sweat, especially to get it right.  But in the
environment here at SDSC (and in other places) checkpointing is a
remarkably useful feature.

Cheers,

gkn
San Diego Supercomputer Center

montnaro@spyder.crd.ge.com (Skip Montanaro) (08/19/90)

In article <17543@ucsd.Edu> gkn@ucsd.Edu (Gerard K. Newman) writes:

   I think it's a bit unfair for every user of a system to have to
   invent a way to do this specific to their particular application.
   In many cases it may not be possible (the above "canned software"
   problem being an example).

I would agree with the above statements if

	a) the effort of creating a programmer/user-transparent
	general-purpose solution was not much more difficult than writing a
	programmer/user-visible application-specific solution,

	b) it was impossible (nearly so) to create application-specific
	solutions to the problem, or

	c) most applications actually needed it.

However, as has been discussed in this and other newsgroups off-and-on over
the past couple of years

	a) it is very hard to solve the general-purpose problem, systems
	like CRAY's checkpoint/restart facility, and the University of
	Wisconsin's RU/Condor systems notwithstanding,

	b) for most applications that need such facilities, they aren't
	terribly difficult to write,

	c) very few applications actually need such facilities.

Given the difficulty of adding a general solution to (various flavors of)
Unix, it is probably wiser to do it on an case-by-case basis. It is unlikely
that most of the relatively few applications that need checkpoint/restart
capabilities will need the full range of capabilities that will need to be
accounted for in a general solution.

As a common case, consider many scientific applications. They typically read
in a large data set, munch on it in an iterative manner for a long period of
time, then write out another large data set. Checkpointing an application of
this sort is pretty trivial. Just write out the intermediate state of the
computation "every so often". If it must be restarted, it can be directed to
read the checkpointed data, restarting the computation from that point.

If the application crashes during the initial input phase, no expensive
computation has been lost. There's a checkpoint facility in place during the
iterative solve phase. During the final output phase, if an error occurs
(such as a full disk, head crash, or system failure), you fall back to the
last checkpoint during the compute phase (if you can recover it from the
disk).

Another example is text editors. Most editors I've used over the past
several years (Emacs of several flavors, vi, EDT), provided some sort of
checkpoint or playback facilities. (EDT's playback was fun to watch.)

As to the second point (canned software packages), checkpoint/restart
capabilities should be treated as a competitive advantage of one package
over another. If your vendor(s) don't provide such facilities, and you need
them, lean on them.  If there's a vendor that does, factor that into your
evaluation. They won't provide it until they realize you need it. The best
way to get them to realize it is with your pocketbook.

--
Skip (montanaro@crdgw1.ge.com)

stripes@eng.umd.edu (Joshua Osborne) (08/19/90)

In article <17543@ucsd.Edu> gkn@ucsd.Edu writes:
[...]
>I think it's a bit unfair for every user of a system to have to
>invent a way to do this specific to their particular application.
>In many cases it may not be possible (the above "canned software"
>problem being an example).

Yes it is.  That's why the people who write the application should do it.
If the OS comes with a package that can do a large part of the work for the
application then the writer will be more likely to do it, but there is no
way the OS can do it.  For example a program that runs on jolt that
does lots of number crunching & sometimes feeds number to coke and sometimes
gets numbers from pepsi.  How could any program that exists only on jolt
handle this?  It has to get coke & pepsi (which may not run the same Unix,
or may not even run Unix) to save that state of whatever the process on
jolt is talking to.  Not very possable.

>I agree that adding this capability to many varieties of Unix may
>require much skull sweat, especially to get it right.  But in the
>environment here at SDSC (and in other places) checkpointing is a
>remarkably useful feature.

No, not skull sweat.  Impossable.
Not 100% impossable, but 10% impossable.  Things that talk to the network
are for the OS to save.  Things that talk to other processes are hard to
save.
-- 
           stripes@eng.umd.edu          "Security for Unix is like
      Josh_Osborne@Real_World,The          Mutitasking for MS-DOS"
      "The dyslexic porgramer"                  - Kevin Lockwood
"Is that a shell script?"                                 - David J. MacKenzie
"Yeah, kinda sticks out like a sore thumb in the middle of a kernel" - K. Lidl

gwyn@smoke.BRL.MIL (Doug Gwyn) (08/19/90)

In article <17543@ucsd.Edu> gkn@ucsd.Edu (Gerard K. Newman) writes:
>I think it's a bit unfair for every user of a system to have to
>invent a way to do this specific to their particular application.

I didn't say that every user needed to do this.  However, every
developer of long-running applications, who had BETTER be computer
literate, should consider such a feature.

>I agree that adding this capability to many varieties of Unix may
>require much skull sweat, especially to get it right.

It is utterly impossible to "get it right" in many cases.  Our Crays
also provide checkpointing, and often we find applications cannot be
properly restarted.  This is not Cray's fault, either, but is inherent
in the rich environment that a UNIX process may be interacting with,
some of which simply cannot be accurately reproduced at a later time.

mwm@decwrl.dec.com (Mike) (08/21/90)

>> >Why does this line come to mind: "Do the easy 90% and give it to the
>> >users; do the hard 10% only if they then ask for it."
>> 
>> Why does the thought come to mind "anyone whose application requires
>> only a 90% chance of executing successfully shouldn't be using the
>> computer at all"?

There's a bad assumption there - that a 90% restart facility
automatically means that any given process will automatically restart
only 90% of the time. Try a more realistic assumptiom - 90% of the
processes on the system don't use any facilities that would break
restarting. That means an applications programmer only needs to insure
that the application in question never uses the 10% of the Unix
facilities that don't work under restart. And if that 10% includes
something critical for a lot of people - they can ask for it, and the
hard part can be done.

>> Other operating systems do not have the rich process environment
>> that UNIX provides.  If there are only a small number of things that
>> need to be straightened out in a batch-processing environment, then
>> system-provided checkpointing is feasible.

And I remember people bragging about how cheap and small Unix
processes were. How things have changed.

	<mike

gwyn@smoke.BRL.MIL (Doug Gwyn) (08/21/90)

In article <24229@adm.BRL.MIL> mwm@decwrl.dec.com (Mike) writes:
>And I remember people bragging about how cheap and small Unix
>processes were. How things have changed.

One thing never seems to change, and that is people bringing totally
irrelevant comments into these discussions.  The discussion to that
point did not involve cheapness/smallness of processes or the converse.

mike@BRL.MIL ( Mike Muuss) (08/21/90)

>> And I remember people bragging about how cheap and small Unix
>> processes were. How things have changed.

UNIX processes still are pretty cheap, compared to more "traditional"
operating systems (like OS/360).  The real source of difficulty
in checkpoint/restart comes from interfaces to "stateful" resources,
like:

*)  Tape drives.  Need to get the right reel back, in the right position.
And hope that no other application or user has modified the tape
in the interval between checkpoint and restart.

*)  Terminals.  All the terminal modes should be saved and restored.
What about other processes that might have come along in the meantime
and started using the terminal, on restart?

*)  Network connections.  The system can't keep the connection open
while it's down.  In general, it is not possible for the operating system
to know how to restore the state of a network connection.  Even saving
the entire output stream and re-sending is not likely to have the
right result.

*)  Temporary files.  If the process depends on files in /tmp (which
may or may not be open at the instant that the checkpoint is taken),
and the system has a policy of clearing /tmp on reboot, then trouble
will result.

Therefore, I assert that it is the state of the I/O system, not the state
of the UNIX processes, that is hard to checkpoint.  Indeed, it is trivial
to checkpoint file pointers, PID's, and other aspects of the *process*
state.  It isn't too hard to make sure that files have not changed
between checkpoint and restart times.

So, please don't bash the UNIX Process concept.  Checkpoint/restart
in any non-trivial I/O environment is *hard*.

Cray Research has been rather successful in implementing checkpoint/
restart in their UNICOS version of UNIX.  I believe that they have
reported on this work, but offhand I don't have any references.

	Best,
	 -Mike Muuss

bzs@world.std.com (Barry Shein) (08/22/90)

TOPS-20 made this sort of thing trivial via the SAVE command. It just
rolled all of your current foreground processes' virtual memory into a
file. Kinda like a core dump, but re-executable. Actually, the
foreground processes' virtual memory was always just kind of there,
sort of like being able to TSTP a process and then adb (ahem, DDT) it.
Not horribly different than adb (et al) defaulting to "core", tho I
think you could continue stepping a stopped job (CMS also had that
virtual memory quality, certainly before TOPS-20, but I don't remember
any easy way to save it to a file and restart it.)

TOPS-20 would issue an interrupt (signal) when the program was
restarted which could be trapped to re-init anything you wanted,
again, not that different from SIGCONT, but across a checkpoint.

*BUT*, it was surely fraught with all the problems mentioned for Unix,
nothing magic, the process had to be able to reinit itself when it got
a restart interrupt, and hope that nothing in the external state had
changed much.

So experience bears out what people are trying to say.

Some of the problems with checkpoint/restart are probably also
potential problems with SIGTSTP'd jobs (try seeing how long you can ^Z
a local uucico process and still continue where you left off.)

Another concern is that it seems to me that once TOPS-20 had a SAVE
facility it tended to get in the way of other design decisions. An
answer to a question "why doesn't TOPS-20 do this" was sometimes
answered with "if they did that then SAVE couldn't work right." I seem
to remember this coming up in some peculiarities with the RESCAN
buffer design (sort of like Unix's argv/argc, or maybe it was just
that it never worked quite right on restarted jobs.)

That's the real design problem, it has the potential of becoming an
enormous, draconian tail wagging a quite harried dog if the OS should
promise to do this. I vote for the library routine and applications
being responsible.

(History buffs, earn points for valuable prizes! Didn't OS/MVT do this
kind of cold/warm reboot, where warm reboots, when possible, just
continued everything other than perhaps the job active when the system
crashed?)
-- 
        -Barry Shein

Software Tool & Die    | {xylogics,uunet}!world!bzs | bzs@world.std.com
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

CES00661%UDELVM@pucc.princeton.edu ( Bob Rahe) (08/22/90)

   Barry Shein hits the nail right on the proverbial head about the tail
wagging the dog - the checkpoint restart code becoming the overwhelming
decision maker in the system.

  I don't know about the historical question ala OS/MVT doing this warm
restart but Burroughs (now Unisys) did this stuff back in the '70s with the
B7700 class mainframes.  The demo was to have a 2x2 system (2 procs and 2 io
procs) and walk up to a running system and pull a proc card out (!).  It would
mostly do the right thing.  It also would, about 75% of the time, restart after
a power failure with only 'currently running on the proc' processes not coming
back, and even those would sometimes work.  They seemed to abandon this as the
MCP got more elaborate for the same reason TOPS dropped it - it seemed all
kinds of nice new features couldn't be made to work if they had to be restart-
able.  Impressive when it worked tho - for the '70s anyway.


      Bob

lars@spectrum.CMC.COM (Lars Poulsen) (08/23/90)

In article <24239@adm.BRL.MIL> mike@BRL.MIL ( Mike Muuss) writes:
> Checkpoint/restart in any non-trivial I/O environment is *hard*.

True, indeed. It is fairly instructive to look at how (other?)
commercial operating systems have dealt with this issue. As could be
expected, there is a wide variety of checkpoint/restart implementations.

The earliest checkpoint/restart implementations in the days of
single-user machines were just memory dumps, with tape drive
repositioning and a way to notify the application that it had been
restarted. IBM70{4,9,40,90,94} type stuff. CDC3600 SCOPE.

When direct-access storage came along, it was originally small, and used
for temporary files; so it was copied to the checkpoint tape. The
checkpoint system that I know best - UNIVAC 1100 EXEC-8 - is of this
type. A checkpoint file is usually a tape file, containing a memory
image, all spool files (input and output) and all temporary files. File
pointers are not an issue, since all permanent disk files are direct
access files (the read/write calls have a file position in them) so
"file pointers" live in user space.

Even so, the checkpoints were complex enough that my installation (an
academic computing center) disabled the checkpoint facility since
ill-structured checkpoint restarts often crashed the system. (How about
restarting from a checkpoint taken on a different system - or before
last week's sysgen).

Interestingly enough, EXEC-8 retrograded in later releases to provide a
lesser checkpoint (memory image only) known as a "partial checkpoint" as
a cheaper and safer alternative.

> ...  The real source of difficulty in checkpoint/restart comes from
>interfaces to "stateful" resources, like:

Yes, there is a TON of state information to be preserved. For all but
trivial tasks, this involves many megabytes of file space.
>
>*)  Tape drives.  Need to get the right reel back, in the right position.

Easy, compared to the other stuff.

>*)  Terminals.  All the terminal modes should be saved and restored.
>What about other processes that might have come along in the meantime
>and started using the terminal, on restart?

Indeed, the semantics of shared terminal devices are a great source of
implementation problems. This a probably a mis-feature.

>*)  Network connections.  The system can't keep the connection ... 

Agreed. Other than the controlling terminal, network connections should
be banned. And the controlling terminal should be a disconnectable
virtual terminal like VMS' VTAxxx: device.

>*)  Temporary files.  If the process depends on files in /tmp ...

The biggest problem here, is that UNIX does not know the concept of
temporary files. A _real_ temporary file is what you have after
	fd = creat("/tmp/xxxx" ...
	unlink("/tmp/xxxx");
But unix would have no way of restoring such a beast, I think.

>Therefore, I assert that it is the state of the I/O system, not the state
>of the UNIX processes, that is hard to checkpoint.  Indeed, it is trivial
>to checkpoint file pointers, PID's, and other aspects of the *process*
>state.  It isn't too hard to make sure that files have not changed
>between checkpoint and restart times.

But in many cases you DO want to change the file. Sometimes the failure
you are recovering from was caused by bad data in a permanent file. You
want to be able to fix the bad record and then restart from the last
checkpoint before that record was seen.

The biggest can of worms has not even been touched upon here: What about
the state of a large DBMS that the checkpointed process may be
accessing. Do you want to restore it to the state when the checkpoint
was taken, thus backing out all updates since the large job failed ?
When the job failed, were all transactions performed by the job backed
out ? If so, the before-and-after-looks need to be part of the
checkpoint so they can be re-installed. What if those records have been
updated since the checkpoint ?

The biggest jobs, which need checkpoints the most, provide the biggest
cans of worms.
-- 
/ Lars Poulsen, SMTS Software Engineer
  CMC Rockwell  lars@CMC.COM

forsyth@minster.york.ac.uk (08/24/90)

Checkpoint/restart facilities in older operating systems were
once suggested to me as a good place to start
if you wanted to break into the system
(only as a demonstration, of course).  They were indeed.
Some systems saved a lot of system state,
yet were careless about checking it when restarting the job.
The data might well be saved in a file (on disc or tape)
that could then be modified in helpful ways.

For instance, one operating system saved the equivalent
of the UNIX u area, and also the equivalent of the inode for each
file.  It was not too much work to patch the base/extent pointers for a file
to point to the part of the disc where the user names and passwords
were kept (unencrypted).  A program would open a file
and checkpoint itself; on restart, it would read the file
and print it out in an attractive form.

The information saved by some systems was so low-level
it is hard to see how it could be checked.
I suppose the information might have been encrypted instead.

How many of systems that provided checkpoints
also gave security problems (if only initially)?
For instance, Lars Poulsen mentions that the UNIVAC EXEC system would
crash if the checkpoint file was `ill-structured'.
That suggests a lack of checking, which could be turned
to a villain's advantage.