vic@cs.arizona.edu (Vicraj T. Thomas) (07/28/90)
I am looking for examples of "traditional" operating systems (i.e. centralized OSs) that allowed user processes to periodically checkpoint their state so that, in case of a failure and subsequent recovery, they could be restarted from the last checkpoint. Names of such OSs or pointers to papers that might contain this information would be greatly appreciated. Thanks, < Vic -- -------- vic@cs.arizona.edu Dept. of Computer Science ..!{uunet|noao}!arizona!vic University of Arizona, Tucson, AZ 85721
gkn@ucsd.Edu (Gerard K. Newman) (07/29/90)
In article <5496@darkstar.ucsc.edu> vic@cs.arizona.edu (Vicraj T. Thomas) writes: > >I am looking for examples of "traditional" operating systems (i.e. centralized >OSs) that allowed user processes to periodically checkpoint their state so >that, in case of a failure and subsequent recovery, they could be restarted >from the last checkpoint. Names of such OSs or pointers to papers that might >contain this information would be greatly appreciated. UNICOS, from Cray Resarch is one such. Also, though less in the mainstream, CTSS (from NERSC/LLNL) also allows this. Cheers gkn San Diego Supercomputer Center
bsy@PLAY.MACH.CS.CMU.EDU (Bennet Yee) (07/31/90)
In article <5536@darkstar.ucsc.edu>, gkn@ucsd.Edu (Gerard K. Newman) writes: |> |> In article <5496@darkstar.ucsc.edu> vic@cs.arizona.edu (Vicraj T. Thomas) writes: |> > |> >I am looking for examples of "traditional" operating systems (i.e. centralized |> >OSs) that allowed user processes to periodically checkpoint their state so |> >that, in case of a failure and subsequent recovery, they could be restarted |> >from the last checkpoint. ... |> |> UNICOS, from Cray Resarch is one such. Also, though less in the mainstream, |> CTSS (from NERSC/LLNL) also allows this. Do you mean to include just the state of the process from the point of view of the OS, or do you include any external servers with which the process have communicated via IPC? [I'm including the file system as part of the traditional OS.] To be more concrete, if we use BSD Unix as an example, is saving the current working directory, the installed signal handlers, the state of the various alarms, the position of all open file descriptors (and the contents of the files, I presume [expensive]), the contents of the address space of the process (presumably just the data and stack segment), and the contents of the user registers sufficient? The general case of including IPC sockets is certainly _much_ more complicated. If what I described above satisfies your definition, then I'd claim that traditional BSD Unix can be made to perform generic checkpointing with a little bit of user code. A few years ago, I implemented a restricted form of this which saves/restores only the address space and the registers to allow some long running jobs to survive reboots/crashes. With the help of my rc file, I had the system continue from the last checkpoint automatically after reboot. Certainly extending my code to save the state of the file descriptors, file contents, etc can be easily done by replacing the C stubs for certain syscalls, and the techniques used can be easily applied to other flavors of Unixes as well. Also, any OS that supports transaction processing could be argued to have this property.... -*-*- Bennet S. Yee, +1 412 268-7571 School of Cucumber Science, Cranberry Melon, Pittsburgh, PA 15213-3890 Internet: bsy+@cs.cmu.edu Uunet: ...!seismo!cs.cmu.edu!bsy+ Csnet: bsy+%cs.cmu.edu@relay.cs.net Bitnet: bsy+%cs.cmu.edu@cmuccvma
tom@stl.stc.co.uk (Tom Thomson) (08/01/90)
In article <5496@darkstar.ucsc.edu> vic@cs.arizona.edu (Vicraj T. Thomas) writes: >I am looking for examples of "traditional" operating systems (i.e. centralized >OSs) that allowed user processes to periodically checkpoint their state so >that, in case of a failure and subsequent recovery, they could be restarted >from the last checkpoint. Names of such OSs or pointers to papers that might >contain this information would be greatly appreciated. Were there really any operating systems that didn't have this after the mid 60s?? The first one that did this that I used was English Electric's System 4 J-level OS. Then Multijob did too. ICT's George III had the facility, that was a bit later. ICL's VME has had it since day 1. In fact the facility is so ordinary that a modern mainframe OS without it would be remarkable. I can't point to papers; there must have been some on those early OSs, but it's so long ago ...... Tom Thomson
pcg@cs.aber.ac.uk (Piercarlo Grandi) (08/02/90)
"vic" == Vicraj T. Thomas writes:
vic> I am looking for examples of "traditional" operating systems (i.e.
vic> centralized OSs) that allowed user processes to periodically
vic> checkpoint their state so that, in case of a failure and subsequent
vic> recovery, they could be restarted from the last checkpoint.
EXEC-8, later renamed OS 1100, the os for the Univac, later Sperry,
later UNISYS 1100 36 bit machines. Its facility is quite comprehensive
and well thought out.
--
Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
rdavis@relay.EU.net (Ray Davis) (09/25/90)
In <5496@darkstar.ucsc.edu> vic@cs.arizona.edu (Vicraj T. Thomas) writes: >I am looking for examples of "traditional" operating systems (i.e. centralized >OSs) that allowed user processes to periodically checkpoint their state so >that, in case of a failure and subsequent recovery, they could be restarted >from the last checkpoint. Names of such OSs or pointers to papers that might >contain this information would be greatly appreciated. ConvexOS 9.0 supports process and process hierarchy checkpoint and restart. Mail me if you want more info. Ray Davis Convex Computer GmbH, Frankfurt, West Germany rdavis@convex.com, unido!connie!rdavis, uunet!convex!rdavis, +49-69-666-8081