[comp.unix.questions] Checkpoints for large jobs

william@syma.sussex.ac.uk (William Craven) (08/06/90)

We have a large number of long running background jobs on the Sequent
Symmetry at Sussex University. Over the last few weeks we were having
to frequently reboot the system. This of course killed the long running
jobs and hence the users had to rerun their jobs. With the frequent
rebooting it was beginning to annoy these users very much as they had to
keep on rerunning their jobs and hence delaying getting their results.

Because of this I was wondering whether there is a system which will allow
a job to start off from when it last killed either by means of checkpointing
or setjmp/longjmp. If there is such a scheme I would be grateful for pointers.

As a side issue - can one reload the core file into a process ? If so
how.

Many thanks,

William Craven

UNIX Systems,			william@syma.sussex.ac.uk
Computing Service		+44-273-606755 ext 2970
University of Sussex
Brighton, BN1 9QJ
United Kingdom

jim@cs.strath.ac.uk (Jim Reid) (08/07/90)

In article <3193@syma.sussex.ac.uk> william@syma.sussex.ac.uk (William Craven) writes:

   I was wondering whether there is a system which will allow a job to
   start off from when it last killed either by means of checkpointing
   or setjmp/longjmp. If there is such a scheme I would be grateful for
   pointers.

Yes. Unix processes have variables end, etext and edata which are
respectively the addresses at the end of the uninitialized data, text
and data "segments" of its address space. All that's needed is to
write out the data space and somehow bodge a stack pointer using
setjmp/longjmp. When the process is restarted, it uses malloc to grow
the data space if needed and then reads the file containing the
dumpded data. The process then has to re-open the files it had open
before the dump and then finally do a longjmp to put the stack back to
a known state before resuming execution. See end(3).

This is more or less what sendmail does to create a frozen
configuration file. On Sequents, all bets are off if the process has used
shared/private memory with lightweight processes created by m_fork(3).
The formats of executable files and core dumps is given by the man
pages for a.out and core, though these files are not nice to poke
around in.

		Jim

montnaro@spyder.crd.ge.com (Skip Montanaro) (08/11/90)

On a case-by-case basis, you may be able to modify your applications so they
will recover. For instance, if your application is an iterative solver of
some sort, you may be able to checkpoint the intermediate data periodically.
When the program is restarted, a flag can be set so the program initializes
from the intermediate solution data.

There was a system a few years ago (maybe 1986?) developed at the University
of Wisconsin that allowed jobs to be restarted (modulo some special I/O
situations). It was reported in a USENIX conference of that era.

Also, UNICOS on the CRAY has a checkpointing facility. You might investigate
it, and ask Sequent why they haven't got something similar.


--
Skip (montanaro@crdgw1.ge.com)

mike@cream.cs.wisc.edu (Mike Litzkow) (08/14/90)

Yes, checkpointing is one part of the Condor system, (previously called RU).
Condor uses cycles on idle workstations by migrating processes to them.  When
the workstations subsequently come under use by their normal users, the condor
jobs are checkpointed, and later moved to another idle workstation to continue
execution.

The checkpointing is accomplished by causing the process to dump core, then
combining parts of the core file with parts of the original executable.  The
software keeps track of what file have been opened and re-opens them after
return from a checkpoint.   This is accomplished by linking the user program
with special versions of "crt0.o" and "libc.a".

Condor is available without charge by anonymous ftp from "shorty.cs.wisc.edu"
(128.105.2.8).  Just log in as "ftp" and give your user name for a password.
Then "cd" to the condor directory and take a look at the Readme file.  You will
be instructed to fetch a compressed binary file, remember to have your ftp
set to "binary" mode for that.

The checkpointing is set up so you can use it without process migration or
remote execution if that is desired.  It is able to run and compile on a
Sequent Symmetry.

-- mike