[comp.arch] Fault Tolerance

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (02/06/90)

In article <1990Feb2.035201.21073@tandem.com> jimbo@tandem (Jim Lyon) writes:
>TMR schemes
>try to shoot the insane processor before it manages to poison the
>outside world. 

Ah, TMR?? Thread Maintenance and Repair??? Test, Monitor, Recovery???

"Shooting" of course implies that there is a way for healthy machines
to do things to insane systems. So, not only does one have "are you
alive" messages, one also has "die" messages. Does Tandem try to keep
insane machines from sending this message?

>In summary, checkpointing not only allows you to survive most of your
>hardware failures, but also most of your operating system bugs, most
>of your database manager bugs, most of your communication protocol bugs,
>most of your transaction manager bugs, and even many of your application
>bugs.

I'm impressed. That's quite a long list.

>If you can't design it [or redesign it] from the start, don't use
>checkpointing.

Do you hold out any hope for automation, or for schemes that trade
off efficiency for ease of retrofit?

>So, what DO you do if you want a high-reliability Unix system? You:
	[list of what are basically cleanups]

Yes, the press reported that Tandem's Unix had fixes in some 800
places where the kernel used to just throw up its hands. Obviously, a
lot of work has been put in.

What ever happened to the Auragen Unix kernel? They did checkpointing
between process pairs, and synchronized them at invervals. (Each Unix
signal caused a synch, because it had to interrupt both processes at
exactly the same instruction.) Synchonization also involved paging
out all dirty pages: certainly an argument against the VAX, which
doesn't know who's dirty.

I believe the Auragen people also pulled some kernel functions into
server processes, where it was easier to make them survive. This
makes the various kernelization projects (such as Mach) sound ever
more attractive.
-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science

yodaiken@freal.cs.umass.edu (victor yodaiken) (02/06/90)

In article <7840@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
>
>What ever happened to the Auragen Unix kernel? They did checkpointing
>between process pairs, and synchronized them at invervals. (Each Unix
>signal caused a synch, because it had to interrupt both processes at
>exactly the same instruction.) Synchonization also involved paging
>out all dirty pages: certainly an argument against the VAX, which
>doesn't know who's dirty.
>
>I believe the Auragen people also pulled some kernel functions into
>server processes, where it was easier to make them survive. This
>makes the various kernelization projects (such as Mach) sound ever
>more attractive.
>-- 
>Don		D.C.Lindsay 	Carnegie Mellon Computer Science


The auragen idea was very simple and, in my biased (I worked for auragen)
opinion, is still the best plan for a fault tolerant system. Although
Auragen died dismally, the o.s. lives on in a Nixdorf machine (Nixdorf
is also reported to be in trouble, makes you wonder).
The basic idea is to force all process i/o to go through messages, a primary
process is associated with an inactive backup process on another machine.
All messages transmitted by the original process must be transmitted to
3 sites: the destination, the destination'backup, and the backup of the
transmitting process. The backup can discard the message, and just keep a
count of how many messages the primary has sent since the last checkpoint. 
Every message accepted by theprimary process must also have been delivered
to its backup and the backup of the sender. When a primary process dies,its
backup is re-started and whenever it sends a message the count of messages
sent by the primary is consulted. If this count is non-zero, the count is
incremented, and the message is discarded: the process is unaware of the
difference, but the o.s. knows the message was previously transmitted and
does not need to be re-sent. Whenever the process tries to read a message, 
it should have messages previously read by the primary already on its input
queue. Whenver the queues of mesages get too big, or the count gets too
high, or whatever, the backup can be synced, that pages of the primary can
be written out on backed up store, and the backups count and input message
queue can be cleared.

We had message bus which forced 3 or none acking of messages, but this is
not strictly necessary. There was a recent 
article in the ACM SIGOPS newsletter on how to apply the Auragen
scheme to MAch. There are a lot of complications hidden in the simplicity
ofthis method, and I don't know how fast it could work in a generic
distributed system  architecture. For example, "time" system calls must
go to a backed up system server i.e. must involve a message transaction,
otherwise, the backup will not see the same time as the primary, and
the recovery might disintegrate.  On the other hand, perhaps the generic
distributed system architecure can't run any o.s. fast.

tim@nucleus.amd.com (Tim Olson) (02/06/90)

In article <7840@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
| In article <1990Feb2.035201.21073@tandem.com> jimbo@tandem (Jim Lyon) writes:
| >TMR schemes
| >try to shoot the insane processor before it manages to poison the
| >outside world. 
| 
| Ah, TMR?? Thread Maintenance and Repair??? Test, Monitor, Recovery???

Triple Modular Redundancy.  This is a 3-way voting scheme where each
output is generated by 3 modules and checked.  The final output is
subject to "majority rule".

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

rtrauben@cortex.Sun.COM (Richard Trauben) (02/07/90)

I am curious about the exact mechanism available to excise a bad processor
or bad processor pair once a bad processor element is detected. This is
especially important for non-TMR, say PE-pairs where only differences are
reported as a kill-my-PE-pair. 

Can anyone who has designed this explain the typical FT kill-me mechanism?

There seem to be several possible kill-me schemes:
1. reset-and-hold-me-down,
2. tristate-me-and-never-let-me-go,
3. relinquish-bus-ownership-and-stop-arbiter-from-ever-granting-me-again,
4. interrupt-me-and-vector-to-branch-to-self.

Presumably no-one is interested in dumping the state of a failed PE-pairs
write-back$; execution would resume from last process checkpoint. 

How about resuming from the checkpoint and unintentionally resending 
redundant mass store and datacom messages. I/O caching and TCPIP
packet sequence numbers might conceal some of these problems but probably
not all.

Back to the voter/exciser: Would the vote-tally-ing circuit itself 
duplicated? (To stop an insane vote tally-er is stopped from bringing
down the system.) Presumably redundant tally- clusters are required 
to stop single point failures and keep running.

In summary, can someone suggest a pointer into FT literature beyond
Computer Structures: Principles and Examples by Bell, et. al?
This is a fascinating area.

Thanks in advance,

Richard

danh@halley.UUCP (Dan Hendrickson) (02/13/90)

In article <35@exodus.Eng.Sun.COM> rtrauben@cortex.EBay.Sun.COM (Richard Trauben) writes:
>
>In summary, can someone suggest a pointer into FT literature beyond
>Computer Structures: Principles and Examples by Bell, et. al?
>This is a fascinating area.
>
>Thanks in advance,
>
>Richard

I am new to the FT world, and have found the book "Design and Analysis of
Fault Tolerant Digital Systems" by Barry Johnson to be useful in getting
up to speed.