fay@ksr.UUCP (Peter Fay) (05/08/90)
Does anyone out there have any experience using checkpoint/restart as a user or coding it in the O.S.? I know Unicos claims to have that facility and have read the Unicos manual, but I know little about how it or any other Unix checkpoint is implemented. What are the issues that come up when putting this into the O.S. (esp. Unix)? What capabilities do users really want and need? Does checkpoint save only files that are currently opened or all that have been accessed in the program to date? What is a Unicos process "category" and how does that relate to pid and chkpnt? Is there a runtime environment that does all the chkpnt/restart for the applications or is it a roll-your-own situation? -pete fay {harvard.harvard.edu!ksr!fay}
purdon@athena.mit.edu (James R Purdon) (05/09/90)
In article <646@ksr.UUCP> fay@ksr.UUCP (Peter Fay) writes: >Does anyone out there have any experience using checkpoint/restart as a user >or coding it in the O.S.? I know Unicos claims to have that facility and have >read the Unicos manual, but I know little about how it or any other Unix >checkpoint is implemented. I've had experience using the checkpoint facility, both at the command level and the subroutine level. There can be problems if files required by the checkpointed job are changed or if the checkpointed job does not own all of its files (in the case of NQS jobs), but within these limits it appears to be a reliable utility (warning: my view is biased). >What are the issues that come up when putting this into the O.S. (esp. Unix)? I don't know the answer to this, but my guess is that you have to worry about saving an image of memory in a file in such a way that you can load it back into memory and start right where you left off. >What capabilities do users really want and need? Users want a seamless, transparent, painless facility. >Does checkpoint save only files that are currently opened or all that have >been accessed in the program to date? It does not even save files - just the buffers and positional pointers. If your files change, no more checkpoint. This does not seem to be a problem for most of the user programs I've encountered. >What is a Unicos process "category" and how does that relate to pid and >chkpnt? I'm not sure what you mean by this. In addition to pids, UNICOS supports "jobs" - the login process and it children, which has a jid. Either pids or jids may be used in checkpoint requests. >Is there a runtime environment that does all the chkpnt/restart for the >applications or is it a roll-your-own situation? The NQS batch system can issue its own checkpoint commands, but the user can also issue checkpoint requests (which the NQS batch system will honor). In this situation, restart only occurs after a crash and is done automatically by NQS. In the interactive environment, the users are on their own. This includes processes placed into the background by nohup, at, and & (but not NQS jobs). Checkpoint and restart requests may be issued at will, and are available at both the command line and subroutine level. While you certainly could automatically checkpoint interactive processes from some central process, I wouldn't recommend it for obvious reasons. Jim -- James Purdon purdon@cons1.mit.edu
mdelany%hbapn1.prime.com@relay.cs.net ( Mark Delany) (08/16/90)
Mark Holcomb <mth@ROLF.STAT.UGA.EDU> writes: > I've felt the need for need for a new tool that Sun doesn't have. > Ever have a process that's been running for six weeks, and will > need another week to finish when you need to make level 0 backups > or would like to shut the computer down for a bad storm. > I need a tool that would stop a running process and let it be > restarted at a later date. > I've thought of a couple of ways it might be done: [A number of suggestions deleted] ... A general solution would have to re-establish and re-position all open files, sockets, message queues, pipes, semphores, shared-memory segments, environment variables, (add your favourite externally visible entity) to exactly the same state as they were previously. Once you've done this, it's a simple matter of re-constructing your memory image. Finally, you have to hope that none of the code in your program has stashed the PID or date away in memory somewhere as these may be different when you next restart the prog :-) Seriously, doing this in any substantive manner is difficult and I'm sure it would be virtually impossible to bullet-proof it on UNIX. When confronted with this array of problems, most people opt for individualized, per-program solutions for those progs that run for long periods. Mark D.
mwm@decwrl.dec.com (Mike) (08/17/90)
>> Mark Holcomb <mth@ROLF.STAT.UGA.EDU> writes: >> > I need a tool that would stop a running process and let it be >> > restarted at a later date. >> >> [problems deleted] >> Seriously, doing this in any substantive manner is difficult and I'm sure >> it would be virtually impossible to bullet-proof it on UNIX. Yes - but you don't need it bullet-proofed; you just need it to work most of the time. After all, being able to restart 90% of the time is much better than being able to restart 0% of the time. Other OSs provide this facility (or similar ones) in the face of these difficulties; Unix ought to be able to. In fact, if I recall correctly, UniCOS (the Cray SysV port) does provide a checkpoint facility, for exactly the kind of long-running processes that Mark was asking for it for. Why does this line come to mind: "Do the easy 90% and give it to the users; do the hard 10% only if they then ask for it." <mike
langley@ds1.scri.fsu.edu (Randolph Langley) (08/17/90)
There is a paper, "Job and Process Recovery In A UNIX-based Operating System", by Brent Kingsbury and John Kline, that talks about UNICOS's checkpointing/restarting capabilities. It is available in the Cray documentation distribution, and I would guess directly from Cray. I also note that the authors have e-mail addresses: they are brent@yafs.cray.com and jtk@hall.cray.com. rdl
dave@csd4.csd.uwm.edu (Dave Rasmussen) (08/17/90)
From article <24193@adm.BRL.MIL>, by mwm@decwrl.dec.com (Mike): > > In fact, if I recall correctly, UniCOS (the Cray SysV port) does > provide a checkpoint facility, for exactly the kind of long-running > processes that Mark was asking for it for. > > Why does this line come to mind: "Do the easy 90% and give it to the > users; do the hard 10% only if they then ask for it." > Convex mentions this may appear in their next release 4th qtr or so as well. -- Internet:dave@uwm.edu, Uucp:uwm!dave, Bitnet:dave%uwm.edu@INTERBIT AT&T:414-229-5133 USnail:Dave Rasmussen-CSD,Box 413 EMS380,Milwaukee,WI 53201
gwyn@smoke.BRL.MIL (Doug Gwyn) (08/18/90)
In article <24193@adm.BRL.MIL> mwm@decwrl.dec.com (Mike) writes: >>> > I need a tool that would stop a running process and let it be >>> > restarted at a later date. >>> Seriously, doing this in any substantive manner is difficult and I'm sure >>> it would be virtually impossible to bullet-proof it on UNIX. >Yes - but you don't need it bullet-proofed; you just need it to work >most of the time. After all, being able to restart 90% of the time is >much better than being able to restart 0% of the time. Other OSs >provide this facility (or similar ones) in the face of these >difficulties; Unix ought to be able to. Other operating systems do not have the rich process environment that UNIX provides. If there are only a small number of things that need to be straightened out in a batch-processing environment, then system-provided checkpointing is feasible. >Why does this line come to mind: "Do the easy 90% and give it to the >users; do the hard 10% only if they then ask for it." Why does the thought come to mind "anyone whose application requires only a 90% chance of executing successfully shouldn't be using the computer at all"? Any application that is EXPECTED to run for a long time should have interruptibility features built into it. I did this back in 1967, and have little sympathy for people who are too lazy to deal with it.
gkn@ucsd.Edu (Gerard K. Newman) (08/18/90)
In article <13611@smoke.BRL.MIL> gwyn@smoke.BRL.MIL (Doug Gwyn) writes: >Any application that is EXPECTED to run for a long time should have >interruptibility features built into it. I did this back in 1967, >and have little sympathy for people who are too lazy to deal with it. True enough, but a minor nit: suppose I am a more-or-less non-computer literate type, who is using some canned commercial software (pick your own favorite package -- there are lots of them) to do some lengthy calculation. In this case, it would be a real plus for the operating system to provide some easy (even automatic) means for periodic checkpointing of the job state. Such systems exist, and many have existed for quite some time. I think it's a bit unfair for every user of a system to have to invent a way to do this specific to their particular application. In many cases it may not be possible (the above "canned software" problem being an example). I agree that adding this capability to many varieties of Unix may require much skull sweat, especially to get it right. But in the environment here at SDSC (and in other places) checkpointing is a remarkably useful feature. Cheers, gkn San Diego Supercomputer Center
montnaro@spyder.crd.ge.com (Skip Montanaro) (08/19/90)
In article <17543@ucsd.Edu> gkn@ucsd.Edu (Gerard K. Newman) writes:
I think it's a bit unfair for every user of a system to have to
invent a way to do this specific to their particular application.
In many cases it may not be possible (the above "canned software"
problem being an example).
I would agree with the above statements if
a) the effort of creating a programmer/user-transparent
general-purpose solution was not much more difficult than writing a
programmer/user-visible application-specific solution,
b) it was impossible (nearly so) to create application-specific
solutions to the problem, or
c) most applications actually needed it.
However, as has been discussed in this and other newsgroups off-and-on over
the past couple of years
a) it is very hard to solve the general-purpose problem, systems
like CRAY's checkpoint/restart facility, and the University of
Wisconsin's RU/Condor systems notwithstanding,
b) for most applications that need such facilities, they aren't
terribly difficult to write,
c) very few applications actually need such facilities.
Given the difficulty of adding a general solution to (various flavors of)
Unix, it is probably wiser to do it on an case-by-case basis. It is unlikely
that most of the relatively few applications that need checkpoint/restart
capabilities will need the full range of capabilities that will need to be
accounted for in a general solution.
As a common case, consider many scientific applications. They typically read
in a large data set, munch on it in an iterative manner for a long period of
time, then write out another large data set. Checkpointing an application of
this sort is pretty trivial. Just write out the intermediate state of the
computation "every so often". If it must be restarted, it can be directed to
read the checkpointed data, restarting the computation from that point.
If the application crashes during the initial input phase, no expensive
computation has been lost. There's a checkpoint facility in place during the
iterative solve phase. During the final output phase, if an error occurs
(such as a full disk, head crash, or system failure), you fall back to the
last checkpoint during the compute phase (if you can recover it from the
disk).
Another example is text editors. Most editors I've used over the past
several years (Emacs of several flavors, vi, EDT), provided some sort of
checkpoint or playback facilities. (EDT's playback was fun to watch.)
As to the second point (canned software packages), checkpoint/restart
capabilities should be treated as a competitive advantage of one package
over another. If your vendor(s) don't provide such facilities, and you need
them, lean on them. If there's a vendor that does, factor that into your
evaluation. They won't provide it until they realize you need it. The best
way to get them to realize it is with your pocketbook.
--
Skip (montanaro@crdgw1.ge.com)
stripes@eng.umd.edu (Joshua Osborne) (08/19/90)
In article <17543@ucsd.Edu> gkn@ucsd.Edu writes: [...] >I think it's a bit unfair for every user of a system to have to >invent a way to do this specific to their particular application. >In many cases it may not be possible (the above "canned software" >problem being an example). Yes it is. That's why the people who write the application should do it. If the OS comes with a package that can do a large part of the work for the application then the writer will be more likely to do it, but there is no way the OS can do it. For example a program that runs on jolt that does lots of number crunching & sometimes feeds number to coke and sometimes gets numbers from pepsi. How could any program that exists only on jolt handle this? It has to get coke & pepsi (which may not run the same Unix, or may not even run Unix) to save that state of whatever the process on jolt is talking to. Not very possable. >I agree that adding this capability to many varieties of Unix may >require much skull sweat, especially to get it right. But in the >environment here at SDSC (and in other places) checkpointing is a >remarkably useful feature. No, not skull sweat. Impossable. Not 100% impossable, but 10% impossable. Things that talk to the network are for the OS to save. Things that talk to other processes are hard to save. -- stripes@eng.umd.edu "Security for Unix is like Josh_Osborne@Real_World,The Mutitasking for MS-DOS" "The dyslexic porgramer" - Kevin Lockwood "Is that a shell script?" - David J. MacKenzie "Yeah, kinda sticks out like a sore thumb in the middle of a kernel" - K. Lidl
gwyn@smoke.BRL.MIL (Doug Gwyn) (08/19/90)
In article <17543@ucsd.Edu> gkn@ucsd.Edu (Gerard K. Newman) writes: >I think it's a bit unfair for every user of a system to have to >invent a way to do this specific to their particular application. I didn't say that every user needed to do this. However, every developer of long-running applications, who had BETTER be computer literate, should consider such a feature. >I agree that adding this capability to many varieties of Unix may >require much skull sweat, especially to get it right. It is utterly impossible to "get it right" in many cases. Our Crays also provide checkpointing, and often we find applications cannot be properly restarted. This is not Cray's fault, either, but is inherent in the rich environment that a UNIX process may be interacting with, some of which simply cannot be accurately reproduced at a later time.
mwm@decwrl.dec.com (Mike) (08/21/90)
>> >Why does this line come to mind: "Do the easy 90% and give it to the >> >users; do the hard 10% only if they then ask for it." >> >> Why does the thought come to mind "anyone whose application requires >> only a 90% chance of executing successfully shouldn't be using the >> computer at all"? There's a bad assumption there - that a 90% restart facility automatically means that any given process will automatically restart only 90% of the time. Try a more realistic assumptiom - 90% of the processes on the system don't use any facilities that would break restarting. That means an applications programmer only needs to insure that the application in question never uses the 10% of the Unix facilities that don't work under restart. And if that 10% includes something critical for a lot of people - they can ask for it, and the hard part can be done. >> Other operating systems do not have the rich process environment >> that UNIX provides. If there are only a small number of things that >> need to be straightened out in a batch-processing environment, then >> system-provided checkpointing is feasible. And I remember people bragging about how cheap and small Unix processes were. How things have changed. <mike
gwyn@smoke.BRL.MIL (Doug Gwyn) (08/21/90)
In article <24229@adm.BRL.MIL> mwm@decwrl.dec.com (Mike) writes: >And I remember people bragging about how cheap and small Unix >processes were. How things have changed. One thing never seems to change, and that is people bringing totally irrelevant comments into these discussions. The discussion to that point did not involve cheapness/smallness of processes or the converse.
mike@BRL.MIL ( Mike Muuss) (08/21/90)
>> And I remember people bragging about how cheap and small Unix >> processes were. How things have changed. UNIX processes still are pretty cheap, compared to more "traditional" operating systems (like OS/360). The real source of difficulty in checkpoint/restart comes from interfaces to "stateful" resources, like: *) Tape drives. Need to get the right reel back, in the right position. And hope that no other application or user has modified the tape in the interval between checkpoint and restart. *) Terminals. All the terminal modes should be saved and restored. What about other processes that might have come along in the meantime and started using the terminal, on restart? *) Network connections. The system can't keep the connection open while it's down. In general, it is not possible for the operating system to know how to restore the state of a network connection. Even saving the entire output stream and re-sending is not likely to have the right result. *) Temporary files. If the process depends on files in /tmp (which may or may not be open at the instant that the checkpoint is taken), and the system has a policy of clearing /tmp on reboot, then trouble will result. Therefore, I assert that it is the state of the I/O system, not the state of the UNIX processes, that is hard to checkpoint. Indeed, it is trivial to checkpoint file pointers, PID's, and other aspects of the *process* state. It isn't too hard to make sure that files have not changed between checkpoint and restart times. So, please don't bash the UNIX Process concept. Checkpoint/restart in any non-trivial I/O environment is *hard*. Cray Research has been rather successful in implementing checkpoint/ restart in their UNICOS version of UNIX. I believe that they have reported on this work, but offhand I don't have any references. Best, -Mike Muuss
bzs@world.std.com (Barry Shein) (08/22/90)
TOPS-20 made this sort of thing trivial via the SAVE command. It just rolled all of your current foreground processes' virtual memory into a file. Kinda like a core dump, but re-executable. Actually, the foreground processes' virtual memory was always just kind of there, sort of like being able to TSTP a process and then adb (ahem, DDT) it. Not horribly different than adb (et al) defaulting to "core", tho I think you could continue stepping a stopped job (CMS also had that virtual memory quality, certainly before TOPS-20, but I don't remember any easy way to save it to a file and restart it.) TOPS-20 would issue an interrupt (signal) when the program was restarted which could be trapped to re-init anything you wanted, again, not that different from SIGCONT, but across a checkpoint. *BUT*, it was surely fraught with all the problems mentioned for Unix, nothing magic, the process had to be able to reinit itself when it got a restart interrupt, and hope that nothing in the external state had changed much. So experience bears out what people are trying to say. Some of the problems with checkpoint/restart are probably also potential problems with SIGTSTP'd jobs (try seeing how long you can ^Z a local uucico process and still continue where you left off.) Another concern is that it seems to me that once TOPS-20 had a SAVE facility it tended to get in the way of other design decisions. An answer to a question "why doesn't TOPS-20 do this" was sometimes answered with "if they did that then SAVE couldn't work right." I seem to remember this coming up in some peculiarities with the RESCAN buffer design (sort of like Unix's argv/argc, or maybe it was just that it never worked quite right on restarted jobs.) That's the real design problem, it has the potential of becoming an enormous, draconian tail wagging a quite harried dog if the OS should promise to do this. I vote for the library routine and applications being responsible. (History buffs, earn points for valuable prizes! Didn't OS/MVT do this kind of cold/warm reboot, where warm reboots, when possible, just continued everything other than perhaps the job active when the system crashed?) -- -Barry Shein Software Tool & Die | {xylogics,uunet}!world!bzs | bzs@world.std.com Purveyors to the Trade | Voice: 617-739-0202 | Login: 617-739-WRLD
CES00661%UDELVM@pucc.princeton.edu ( Bob Rahe) (08/22/90)
Barry Shein hits the nail right on the proverbial head about the tail wagging the dog - the checkpoint restart code becoming the overwhelming decision maker in the system. I don't know about the historical question ala OS/MVT doing this warm restart but Burroughs (now Unisys) did this stuff back in the '70s with the B7700 class mainframes. The demo was to have a 2x2 system (2 procs and 2 io procs) and walk up to a running system and pull a proc card out (!). It would mostly do the right thing. It also would, about 75% of the time, restart after a power failure with only 'currently running on the proc' processes not coming back, and even those would sometimes work. They seemed to abandon this as the MCP got more elaborate for the same reason TOPS dropped it - it seemed all kinds of nice new features couldn't be made to work if they had to be restart- able. Impressive when it worked tho - for the '70s anyway. Bob
lars@spectrum.CMC.COM (Lars Poulsen) (08/23/90)
In article <24239@adm.BRL.MIL> mike@BRL.MIL ( Mike Muuss) writes: > Checkpoint/restart in any non-trivial I/O environment is *hard*. True, indeed. It is fairly instructive to look at how (other?) commercial operating systems have dealt with this issue. As could be expected, there is a wide variety of checkpoint/restart implementations. The earliest checkpoint/restart implementations in the days of single-user machines were just memory dumps, with tape drive repositioning and a way to notify the application that it had been restarted. IBM70{4,9,40,90,94} type stuff. CDC3600 SCOPE. When direct-access storage came along, it was originally small, and used for temporary files; so it was copied to the checkpoint tape. The checkpoint system that I know best - UNIVAC 1100 EXEC-8 - is of this type. A checkpoint file is usually a tape file, containing a memory image, all spool files (input and output) and all temporary files. File pointers are not an issue, since all permanent disk files are direct access files (the read/write calls have a file position in them) so "file pointers" live in user space. Even so, the checkpoints were complex enough that my installation (an academic computing center) disabled the checkpoint facility since ill-structured checkpoint restarts often crashed the system. (How about restarting from a checkpoint taken on a different system - or before last week's sysgen). Interestingly enough, EXEC-8 retrograded in later releases to provide a lesser checkpoint (memory image only) known as a "partial checkpoint" as a cheaper and safer alternative. > ... The real source of difficulty in checkpoint/restart comes from >interfaces to "stateful" resources, like: Yes, there is a TON of state information to be preserved. For all but trivial tasks, this involves many megabytes of file space. > >*) Tape drives. Need to get the right reel back, in the right position. Easy, compared to the other stuff. >*) Terminals. All the terminal modes should be saved and restored. >What about other processes that might have come along in the meantime >and started using the terminal, on restart? Indeed, the semantics of shared terminal devices are a great source of implementation problems. This a probably a mis-feature. >*) Network connections. The system can't keep the connection ... Agreed. Other than the controlling terminal, network connections should be banned. And the controlling terminal should be a disconnectable virtual terminal like VMS' VTAxxx: device. >*) Temporary files. If the process depends on files in /tmp ... The biggest problem here, is that UNIX does not know the concept of temporary files. A _real_ temporary file is what you have after fd = creat("/tmp/xxxx" ... unlink("/tmp/xxxx"); But unix would have no way of restoring such a beast, I think. >Therefore, I assert that it is the state of the I/O system, not the state >of the UNIX processes, that is hard to checkpoint. Indeed, it is trivial >to checkpoint file pointers, PID's, and other aspects of the *process* >state. It isn't too hard to make sure that files have not changed >between checkpoint and restart times. But in many cases you DO want to change the file. Sometimes the failure you are recovering from was caused by bad data in a permanent file. You want to be able to fix the bad record and then restart from the last checkpoint before that record was seen. The biggest can of worms has not even been touched upon here: What about the state of a large DBMS that the checkpointed process may be accessing. Do you want to restore it to the state when the checkpoint was taken, thus backing out all updates since the large job failed ? When the job failed, were all transactions performed by the job backed out ? If so, the before-and-after-looks need to be part of the checkpoint so they can be re-installed. What if those records have been updated since the checkpoint ? The biggest jobs, which need checkpoints the most, provide the biggest cans of worms. -- / Lars Poulsen, SMTS Software Engineer CMC Rockwell lars@CMC.COM
forsyth@minster.york.ac.uk (08/24/90)
Checkpoint/restart facilities in older operating systems were once suggested to me as a good place to start if you wanted to break into the system (only as a demonstration, of course). They were indeed. Some systems saved a lot of system state, yet were careless about checking it when restarting the job. The data might well be saved in a file (on disc or tape) that could then be modified in helpful ways. For instance, one operating system saved the equivalent of the UNIX u area, and also the equivalent of the inode for each file. It was not too much work to patch the base/extent pointers for a file to point to the part of the disc where the user names and passwords were kept (unencrypted). A program would open a file and checkpoint itself; on restart, it would read the file and print it out in an attractive form. The information saved by some systems was so low-level it is hard to see how it could be checked. I suppose the information might have been encrypted instead. How many of systems that provided checkpoints also gave security problems (if only initially)? For instance, Lars Poulsen mentions that the UNIVAC EXEC system would crash if the checkpoint file was `ill-structured'. That suggests a lack of checking, which could be turned to a villain's advantage.