[comp.arch] What you need for crash recovery

albaugh@dms.UUCP (Mike Albaugh) (02/05/91)

From article <1991Feb02.112415.6180@kithrup.COM>, by sef@kithrup.COM (Sean Eric Fagan):
> In article <2880@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>>In article <13252@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>>> [...] Oh, UNIX doesn't have any automatic crash recovery.
>>Neither do NOS/BE and VM/CMS last time I tried.  But Unicos can do this.
> [... Some Cybers flush all of memory on power-fail, then suck it back and go ]
> 
> Unix doesn't have automatic crash recovery, *in general*, because there is
> no standard way to do it.  Note that SysV has a SIGPWR, but I don't think
> anyone actually uses it.  However, it shouldn't be too difficult to hook up
> a UPS to a unix box, and write a driver that gets a signal from the UPS and
> puts a copy of system memory onto disk; all you need to do, then, is add an
> option to the startup sequence to "recover," and, again, you're set.

	HA HA HA (did I miss a smiley?) We have mainly VMS vaxen, a few Sparc
Suns, and some SYSV '[34]86's (plus a handful of Regulus 68K boxes, don't ask).
Anyway, power failure is about the _least_ common reason for system crashes.
Going into a coma with power on is much more common on all these machines.
Saving and restoring all memory on a comatose machine gets you right back to
a comatose machine. _Real_ crash recovery involves checkpointing individual
jobs, so they can come back up once you have restored the OS to sanity (perhaps
by shooting the node that was vomiting on the ethernet). The problems of
resources like files being changed, or going away entirely, in the mean-time
are far from trivial, even in the case of a power-fail induced crash. Consider
the case of failure of one branch circuit, bring down _some_ of your systems.
Meanwhile, other machines serving disks to the "victims" chunk merrily along.
And then there are those locks on the database at the far end of a T1 span....

	Key phrase: distrust simple answers to complex questions.

					Mike

| Mike Albaugh (albaugh@dms.UUCP || {...decwrl!pyramid!}weitek!dms!albaugh)
| Atari Games Corp (Arcade Games, no relation to the makers of the ST)
| 675 Sycamore Dr. Milpitas, CA 95035		voice: (408)434-1709
| The opinions expressed are my own (Boy, are they ever)

lusol@vax1.cc.lehigh.edu (02/07/91)

     Jim Giles talks about U*X job recovery after a system crash, line
disconnect, or a program error such as a divide fault.  The various replies
indicate that U*X does a poor job in these situations, athough some flavors can
handle some of these situations properly.

     For instance Dik T. Winter mentions that UNICOS has automatic job
recovery after a crash, and Colin Plumb mentions undump and Mach's macho file
format for recovering an aborted job.

     But it seems there is NO CONSISTENCY in the U*X world with regard to job
recovery in general.

     There is an operating I use that handles all of these situation very
nicely.

        1) Job recovery after a system crash
          The operating system supports active job recovery, there is no
          need to periodically write checkpoint files on a job by job
          basis. It even recovers after most hardware failures except a
          loss of power to the machine room.  With this feature you can
          deadstart the machine at any time for whatever reason and not
          lose your ANSYS and ADINA grinders.

        2) Job recovery after a line disconnect
          When you login the operating system automatically displays a
          list of your detached jobs.  You either select a detached job
          or continue your current session.  There is no way another user
          can gain control of your job.

        3) Program recovery after a fault
          Run under control of the debugger and you can change the necessary
          variables and restart the job.  Simple.  No undumping and converting
          core files to executables.

     My question:  what is the situation in the U*X world with regard to these
three problems?  Which flavors can do what?  Is there any flavor that handles
these situations as nicely as CDC's NOS/VE OS (-:?

Steve

Lehigh University Computing Center
Stephen.O.Lidie@CDC1.CC.Lehigh.EDU

mmm@cup.portal.com (Mark Robert Thorson) (02/08/91)

Quoting from the Intel 80C196KB User's Guide:

"It is recommended that unused areas of code be filled with NOPs and
periodic jumps to an error routine or RST (reset chip) instructions.
This is particularly important in the code around lookup tables,
since if lookup tables are executed undesired results will occur.
Wherever space allows, each table should be surrounded by 7 NOPs
(the longest 80C196KB instruction has 7 bytes) and a RST or jump to
error routine instruction.  Since RST is a one-byte instruction,
the NOPs are not needed if RSTs are used instead of jumps to an error
routine.  This will help to insure a speedy recovery should the
processor have a glitch in the program flow."

Obviously some people have worse error recovery problems than others, eh?

dana@locus.com (Dana H. Myers) (02/13/91)

In article <39039@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes:
>Quoting from the Intel 80C196KB User's Guide:
>
>"It is recommended that unused areas of code be filled with NOPs and
>periodic jumps to an error routine or RST (reset chip) instructions.

[ rest of advice regarding error recovery deleted ]

>Obviously some people have worse error recovery problems than others, eh?

   Sure. The MCS-96 family, of which the 80C196KB is a more recent member,
is intended for use as an embedded microcontroller. One use of this family
is in automotive engine control (my prototype fuel injection project uses
an 8097, for instance), and the automotive environment is very noisy.
Other applications of the MCS-96 include machine control, once again in
an electrically and mechanically harsh environment.

  The error recovery needs in these situations are particularly demanding.
TImely recovery is often required to avoid damage to the system under
control. BTW - the MCS-96 family also has a watchdog timer.

-- 
 * Dana H. Myers KK6JQ 		| Views expressed here are	*
 * (213) 337-5136 		| mine and do not necessarily	*
 * dana@locus.com		| reflect those of my employer	*