mike@cream.cs.wisc.edu (Mike Litzkow) (08/16/90)
Archive-name: condor/14-Aug-90 Original-posting-by: mike@cream.cs.wisc.edu (Mike Litzkow) Original-subject: Re: Checkpoints for large jobs Archive-site: shorty.cs.wisc.edu [128.105.2.8] Archive-directory: /condor Reposted-by: emv@math.lsa.umich.edu (Edward Vielmetti) Yes, checkpointing is one part of the Condor system, (previously called RU). Condor uses cycles on idle workstations by migrating processes to them. When the workstations subsequently come under use by their normal users, the condor jobs are checkpointed, and later moved to another idle workstation to continue execution. The checkpointing is accomplished by causing the process to dump core, then combining parts of the core file with parts of the original executable. The software keeps track of what file have been opened and re-opens them after return from a checkpoint. This is accomplished by linking the user program with special versions of "crt0.o" and "libc.a". Condor is available without charge by anonymous ftp from "shorty.cs.wisc.edu" (128.105.2.8). Just log in as "ftp" and give your user name for a password. Then "cd" to the condor directory and take a look at the Readme file. You will be instructed to fetch a compressed binary file, remember to have your ftp set to "binary" mode for that. The checkpointing is set up so you can use it without process migration or remote execution if that is desired. It is able to run and compile on a Sequent Symmetry. -- mike