[comp.std.unix] Checkpointing for Unix?

breynolds@UCSD.EDU (Bill Reynolds) (04/27/91)

Submitted-by: breynolds@UCSD.EDU (Bill Reynolds)

I originally posted this to comp.unix.questions. It was then
recommended to me that I post here as well.

>Greetings,
>	We are a computational physics group running a network of Sun 
>and SGI workstations. We often have long running jobs on many of our
>machines. This leads to problems when a machine needs to be taken down
>that has a job in the third day of a five day run. What we would like
>is a routine to checkpoint a job to a disk file for later reloading
>into memory. I've looked at undump, but isn't adequate, we need to
>restart the job where it was interrupted. I've also looked at condor,
>but it seems to be a fly-with-a-sledgehammer type solution. I'm
>wondering if there are any simple unix/sun/sgi utilities to do
>checkpointing. (I know that such facilities exist for crays).

I would also like to add that such a facility would have to support
fortran and would have to be simple enough to use that someone with
only a background in scientific computing could use it (i.e. no system
calls, no calls to c routines from fortran, etc). It has also been
suggested that I modify the code to undump. I find this a daunting
task (any takers?). (By the way, I have not actually gotten an undump
working for the sun or the sgi).

--
_______________________________________________________________________
						|  Bill Reynolds
	  				 	|  bill@inls1.ucsd.edu

[ First of all, there is Dan Bernstein's Poor Man's Checkpointing Package, 
  posted to alt.sources (I think) a month or three ago.  Also, one of
  the POSIX subgroups specifies checkpointing, that being the main reason
  I'm posting this.  I will let others (who are likely to be more
  knowledgeable about it) comment further, if they wish. -- mod ]

Volume-Number: Volume 23, Number 47