[comp.os.research] OSs supporting checkpointing: looking for examples.

vic@cs.arizona.edu (Vicraj T. Thomas) (07/28/90)

I am looking for examples of "traditional" operating systems (i.e. centralized
OSs) that allowed user processes to periodically checkpoint their state so
that, in case of a failure and subsequent recovery, they could be restarted
from the last checkpoint.  Names of such OSs or pointers to papers that might
contain this information would be greatly appreciated.

Thanks,

< Vic

-- 

--------
vic@cs.arizona.edu              Dept. of Computer Science
..!{uunet|noao}!arizona!vic     University of Arizona, Tucson, AZ 85721

gkn@ucsd.Edu (Gerard K. Newman) (07/29/90)

In article <5496@darkstar.ucsc.edu> vic@cs.arizona.edu (Vicraj T. Thomas) writes:
>
>I am looking for examples of "traditional" operating systems (i.e. centralized
>OSs) that allowed user processes to periodically checkpoint their state so
>that, in case of a failure and subsequent recovery, they could be restarted
>from the last checkpoint.  Names of such OSs or pointers to papers that might
>contain this information would be greatly appreciated.

UNICOS, from Cray Resarch is one such.  Also, though less in the mainstream,
CTSS (from NERSC/LLNL) also allows this.

Cheers

gkn
San Diego Supercomputer Center

bsy@PLAY.MACH.CS.CMU.EDU (Bennet Yee) (07/31/90)

In article <5536@darkstar.ucsc.edu>, gkn@ucsd.Edu (Gerard K. Newman) writes:
|> 
|> In article <5496@darkstar.ucsc.edu> vic@cs.arizona.edu (Vicraj T.
Thomas) writes:
|> >
|> >I am looking for examples of "traditional" operating systems (i.e.
centralized
|> >OSs) that allowed user processes to periodically checkpoint their state so
|> >that, in case of a failure and subsequent recovery, they could be restarted
|> >from the last checkpoint. ...
|> 
|> UNICOS, from Cray Resarch is one such.  Also, though less in the mainstream,
|> CTSS (from NERSC/LLNL) also allows this.

Do you mean to include just the state of the process from the point of view
of the OS, or do you include any external servers with which the process
have communicated via IPC?  [I'm including the file system as part of the
traditional OS.]

To be more concrete, if we use BSD Unix as an example, is saving the current
working directory, the installed signal handlers, the state of the various
alarms, the position of all open file descriptors (and the contents of the
files, I presume [expensive]), the contents of the address space of the
process (presumably just the data and stack segment), and the contents of
the user registers sufficient?  The general case of including IPC sockets is
certainly _much_ more complicated.

If what I described above satisfies your definition, then I'd claim that
traditional BSD Unix can be made to perform generic checkpointing with a
little bit of user code.  A few years ago, I implemented a restricted form
of this which saves/restores only the address space and the registers to
allow some long running jobs to survive reboots/crashes.  With the help of
my rc file, I had the system continue from the last checkpoint automatically
after reboot.  Certainly extending my code to save the state of the file
descriptors, file contents, etc can be easily done by replacing the C stubs
for certain syscalls, and the techniques used can be easily applied to other
flavors of Unixes as well.

Also, any OS that supports transaction processing could be argued to have
this property....

-*-*-
Bennet S. Yee, +1 412 268-7571
School of Cucumber Science, Cranberry Melon, Pittsburgh, PA 15213-3890
Internet: bsy+@cs.cmu.edu		Uunet: ...!seismo!cs.cmu.edu!bsy+
Csnet: bsy+%cs.cmu.edu@relay.cs.net	Bitnet: bsy+%cs.cmu.edu@cmuccvma

tom@stl.stc.co.uk (Tom Thomson) (08/01/90)

In article <5496@darkstar.ucsc.edu> vic@cs.arizona.edu (Vicraj T. Thomas) writes:
>I am looking for examples of "traditional" operating systems (i.e. centralized
>OSs) that allowed user processes to periodically checkpoint their state so
>that, in case of a failure and subsequent recovery, they could be restarted
>from the last checkpoint.  Names of such OSs or pointers to papers that might
>contain this information would be greatly appreciated.

Were there really any operating systems that didn't have this after the
mid 60s??
 
The first one that did this that I used was English Electric's System 4
J-level OS.  Then Multijob did too.  ICT's George III had the facility,
that was a bit later.  ICL's VME has had it since day 1.   In fact the
facility is so ordinary that a modern mainframe OS without it would be
remarkable.
 
I can't point to papers; there must have been some on those early OSs, but
it's so long ago ......
 
Tom Thomson

pcg@cs.aber.ac.uk (Piercarlo Grandi) (08/02/90)

"vic" == Vicraj T. Thomas writes:

vic> I am looking for examples of "traditional" operating systems (i.e.
vic> centralized OSs) that allowed user processes to periodically
vic> checkpoint their state so that, in case of a failure and subsequent
vic> recovery, they could be restarted from the last checkpoint.

EXEC-8, later renamed OS 1100, the os for the Univac, later Sperry,
later UNISYS 1100 36 bit machines. Its facility is quite comprehensive
and well thought out.
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

rdavis@relay.EU.net (Ray Davis) (09/25/90)

In <5496@darkstar.ucsc.edu> vic@cs.arizona.edu (Vicraj T. Thomas) writes:

>I am looking for examples of "traditional" operating systems (i.e. centralized
>OSs) that allowed user processes to periodically checkpoint their state so
>that, in case of a failure and subsequent recovery, they could be restarted
>from the last checkpoint.  Names of such OSs or pointers to papers that might
>contain this information would be greatly appreciated.

ConvexOS 9.0 supports process and process hierarchy checkpoint
and restart.  Mail me if you want more info.

Ray Davis
Convex Computer GmbH, Frankfurt, West Germany
rdavis@convex.com, unido!connie!rdavis, uunet!convex!rdavis, +49-69-666-8081