[comp.unix.cray] checkpoint/restart

fay@ksr.UUCP (Peter Fay) (05/08/90)

Does anyone out there have any experience using checkpoint/restart as a user
or coding it in the O.S.? I know Unicos claims to have that facility and have
read the Unicos manual, but I know little about how it or any other Unix
checkpoint is implemented.

What are the issues that come up when putting this into the O.S. (esp. Unix)?
What capabilities do users really want and need?
Does checkpoint save only files that are currently opened or all that have
been accessed in the program to date?
What is a Unicos process "category" and how does that relate to pid and
chkpnt?
Is there a runtime environment that does all the chkpnt/restart for the
applications or is it a roll-your-own situation?

-pete fay
{harvard.harvard.edu!ksr!fay}

purdon@athena.mit.edu (James R Purdon) (05/09/90)

In article <646@ksr.UUCP> fay@ksr.UUCP (Peter Fay) writes:
>Does anyone out there have any experience using checkpoint/restart as a user
>or coding it in the O.S.? I know Unicos claims to have that facility and have
>read the Unicos manual, but I know little about how it or any other Unix
>checkpoint is implemented.

I've had experience using the checkpoint facility, both at the command level
and the subroutine level.  There can be problems if files required by the
checkpointed job are changed or if the checkpointed job does not own all
of its files (in the case of NQS jobs), but within these limits it appears to
be a reliable utility (warning: my view is biased).

>What are the issues that come up when putting this into the O.S. (esp. Unix)?

I don't know the answer to this, but my guess is that you have to worry about
saving an image of memory in a file in such a way that you can load it back
into memory and start right where you left off.

>What capabilities do users really want and need?

Users want a seamless, transparent, painless facility.

>Does checkpoint save only files that are currently opened or all that have
>been accessed in the program to date?

It does not even save files - just the buffers and positional pointers.  If
your files change, no more checkpoint.  This does not seem to be a problem
for most of the user programs I've encountered.

>What is a Unicos process "category" and how does that relate to pid and
>chkpnt?

I'm not sure what you mean by this.  In addition to pids, UNICOS supports
"jobs" - the login process and it children, which has a jid.  Either pids
or jids may be used in checkpoint requests.

>Is there a runtime environment that does all the chkpnt/restart for the
>applications or is it a roll-your-own situation?

The NQS batch system can issue its own checkpoint commands, but the user
can also issue checkpoint requests (which the NQS batch system will honor).
In this situation, restart only occurs after a crash and is done automatically
by NQS.

In the interactive environment, the users are on their own.  This includes
processes placed into the background by nohup, at, and & (but not NQS
jobs).  Checkpoint and restart requests may be issued at will, and are
available at both the command line and subroutine level.

While you certainly could automatically checkpoint interactive processes
from some central process, I wouldn't recommend it for obvious reasons.

Jim

--
James Purdon
purdon@cons1.mit.edu