breynolds@UCSD.EDU (Bill Reynolds) (04/27/91)
Submitted-by: breynolds@UCSD.EDU (Bill Reynolds) I originally posted this to comp.unix.questions. It was then recommended to me that I post here as well. >Greetings, > We are a computational physics group running a network of Sun >and SGI workstations. We often have long running jobs on many of our >machines. This leads to problems when a machine needs to be taken down >that has a job in the third day of a five day run. What we would like >is a routine to checkpoint a job to a disk file for later reloading >into memory. I've looked at undump, but isn't adequate, we need to >restart the job where it was interrupted. I've also looked at condor, >but it seems to be a fly-with-a-sledgehammer type solution. I'm >wondering if there are any simple unix/sun/sgi utilities to do >checkpointing. (I know that such facilities exist for crays). I would also like to add that such a facility would have to support fortran and would have to be simple enough to use that someone with only a background in scientific computing could use it (i.e. no system calls, no calls to c routines from fortran, etc). It has also been suggested that I modify the code to undump. I find this a daunting task (any takers?). (By the way, I have not actually gotten an undump working for the sun or the sgi). -- _______________________________________________________________________ | Bill Reynolds | bill@inls1.ucsd.edu [ First of all, there is Dan Bernstein's Poor Man's Checkpointing Package, posted to alt.sources (I think) a month or three ago. Also, one of the POSIX subgroups specifies checkpointing, that being the main reason I'm posting this. I will let others (who are likely to be more knowledgeable about it) comment further, if they wish. -- mod ] Volume-Number: Volume 23, Number 47