fay@ksr.UUCP (Peter Fay) (05/08/90)
Does anyone out there have any experience using checkpoint/restart as a user or coding it in the O.S.? I know Unicos claims to have that facility and have read the Unicos manual, but I know little about how it or any other Unix checkpoint is implemented. What are the issues that come up when putting this into the O.S. (esp. Unix)? What capabilities do users really want and need? Does checkpoint save only files that are currently opened or all that have been accessed in the program to date? What is a Unicos process "category" and how does that relate to pid and chkpnt? Is there a runtime environment that does all the chkpnt/restart for the applications or is it a roll-your-own situation? -pete fay {harvard.harvard.edu!ksr!fay}
purdon@athena.mit.edu (James R Purdon) (05/09/90)
In article <646@ksr.UUCP> fay@ksr.UUCP (Peter Fay) writes: >Does anyone out there have any experience using checkpoint/restart as a user >or coding it in the O.S.? I know Unicos claims to have that facility and have >read the Unicos manual, but I know little about how it or any other Unix >checkpoint is implemented. I've had experience using the checkpoint facility, both at the command level and the subroutine level. There can be problems if files required by the checkpointed job are changed or if the checkpointed job does not own all of its files (in the case of NQS jobs), but within these limits it appears to be a reliable utility (warning: my view is biased). >What are the issues that come up when putting this into the O.S. (esp. Unix)? I don't know the answer to this, but my guess is that you have to worry about saving an image of memory in a file in such a way that you can load it back into memory and start right where you left off. >What capabilities do users really want and need? Users want a seamless, transparent, painless facility. >Does checkpoint save only files that are currently opened or all that have >been accessed in the program to date? It does not even save files - just the buffers and positional pointers. If your files change, no more checkpoint. This does not seem to be a problem for most of the user programs I've encountered. >What is a Unicos process "category" and how does that relate to pid and >chkpnt? I'm not sure what you mean by this. In addition to pids, UNICOS supports "jobs" - the login process and it children, which has a jid. Either pids or jids may be used in checkpoint requests. >Is there a runtime environment that does all the chkpnt/restart for the >applications or is it a roll-your-own situation? The NQS batch system can issue its own checkpoint commands, but the user can also issue checkpoint requests (which the NQS batch system will honor). In this situation, restart only occurs after a crash and is done automatically by NQS. In the interactive environment, the users are on their own. This includes processes placed into the background by nohup, at, and & (but not NQS jobs). Checkpoint and restart requests may be issued at will, and are available at both the command line and subroutine level. While you certainly could automatically checkpoint interactive processes from some central process, I wouldn't recommend it for obvious reasons. Jim -- James Purdon purdon@cons1.mit.edu