lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (02/06/90)
In article <1990Feb2.035201.21073@tandem.com> jimbo@tandem (Jim Lyon) writes: >TMR schemes >try to shoot the insane processor before it manages to poison the >outside world. Ah, TMR?? Thread Maintenance and Repair??? Test, Monitor, Recovery??? "Shooting" of course implies that there is a way for healthy machines to do things to insane systems. So, not only does one have "are you alive" messages, one also has "die" messages. Does Tandem try to keep insane machines from sending this message? >In summary, checkpointing not only allows you to survive most of your >hardware failures, but also most of your operating system bugs, most >of your database manager bugs, most of your communication protocol bugs, >most of your transaction manager bugs, and even many of your application >bugs. I'm impressed. That's quite a long list. >If you can't design it [or redesign it] from the start, don't use >checkpointing. Do you hold out any hope for automation, or for schemes that trade off efficiency for ease of retrofit? >So, what DO you do if you want a high-reliability Unix system? You: [list of what are basically cleanups] Yes, the press reported that Tandem's Unix had fixes in some 800 places where the kernel used to just throw up its hands. Obviously, a lot of work has been put in. What ever happened to the Auragen Unix kernel? They did checkpointing between process pairs, and synchronized them at invervals. (Each Unix signal caused a synch, because it had to interrupt both processes at exactly the same instruction.) Synchonization also involved paging out all dirty pages: certainly an argument against the VAX, which doesn't know who's dirty. I believe the Auragen people also pulled some kernel functions into server processes, where it was easier to make them survive. This makes the various kernelization projects (such as Mach) sound ever more attractive. -- Don D.C.Lindsay Carnegie Mellon Computer Science
yodaiken@freal.cs.umass.edu (victor yodaiken) (02/06/90)
In article <7840@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: > >What ever happened to the Auragen Unix kernel? They did checkpointing >between process pairs, and synchronized them at invervals. (Each Unix >signal caused a synch, because it had to interrupt both processes at >exactly the same instruction.) Synchonization also involved paging >out all dirty pages: certainly an argument against the VAX, which >doesn't know who's dirty. > >I believe the Auragen people also pulled some kernel functions into >server processes, where it was easier to make them survive. This >makes the various kernelization projects (such as Mach) sound ever >more attractive. >-- >Don D.C.Lindsay Carnegie Mellon Computer Science The auragen idea was very simple and, in my biased (I worked for auragen) opinion, is still the best plan for a fault tolerant system. Although Auragen died dismally, the o.s. lives on in a Nixdorf machine (Nixdorf is also reported to be in trouble, makes you wonder). The basic idea is to force all process i/o to go through messages, a primary process is associated with an inactive backup process on another machine. All messages transmitted by the original process must be transmitted to 3 sites: the destination, the destination'backup, and the backup of the transmitting process. The backup can discard the message, and just keep a count of how many messages the primary has sent since the last checkpoint. Every message accepted by theprimary process must also have been delivered to its backup and the backup of the sender. When a primary process dies,its backup is re-started and whenever it sends a message the count of messages sent by the primary is consulted. If this count is non-zero, the count is incremented, and the message is discarded: the process is unaware of the difference, but the o.s. knows the message was previously transmitted and does not need to be re-sent. Whenever the process tries to read a message, it should have messages previously read by the primary already on its input queue. Whenver the queues of mesages get too big, or the count gets too high, or whatever, the backup can be synced, that pages of the primary can be written out on backed up store, and the backups count and input message queue can be cleared. We had message bus which forced 3 or none acking of messages, but this is not strictly necessary. There was a recent article in the ACM SIGOPS newsletter on how to apply the Auragen scheme to MAch. There are a lot of complications hidden in the simplicity ofthis method, and I don't know how fast it could work in a generic distributed system architecture. For example, "time" system calls must go to a backed up system server i.e. must involve a message transaction, otherwise, the backup will not see the same time as the primary, and the recovery might disintegrate. On the other hand, perhaps the generic distributed system architecure can't run any o.s. fast.
tim@nucleus.amd.com (Tim Olson) (02/06/90)
In article <7840@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: | In article <1990Feb2.035201.21073@tandem.com> jimbo@tandem (Jim Lyon) writes: | >TMR schemes | >try to shoot the insane processor before it manages to poison the | >outside world. | | Ah, TMR?? Thread Maintenance and Repair??? Test, Monitor, Recovery??? Triple Modular Redundancy. This is a 3-way voting scheme where each output is generated by 3 modules and checked. The final output is subject to "majority rule". -- Tim Olson Advanced Micro Devices (tim@amd.com)
rtrauben@cortex.Sun.COM (Richard Trauben) (02/07/90)
I am curious about the exact mechanism available to excise a bad processor or bad processor pair once a bad processor element is detected. This is especially important for non-TMR, say PE-pairs where only differences are reported as a kill-my-PE-pair. Can anyone who has designed this explain the typical FT kill-me mechanism? There seem to be several possible kill-me schemes: 1. reset-and-hold-me-down, 2. tristate-me-and-never-let-me-go, 3. relinquish-bus-ownership-and-stop-arbiter-from-ever-granting-me-again, 4. interrupt-me-and-vector-to-branch-to-self. Presumably no-one is interested in dumping the state of a failed PE-pairs write-back$; execution would resume from last process checkpoint. How about resuming from the checkpoint and unintentionally resending redundant mass store and datacom messages. I/O caching and TCPIP packet sequence numbers might conceal some of these problems but probably not all. Back to the voter/exciser: Would the vote-tally-ing circuit itself duplicated? (To stop an insane vote tally-er is stopped from bringing down the system.) Presumably redundant tally- clusters are required to stop single point failures and keep running. In summary, can someone suggest a pointer into FT literature beyond Computer Structures: Principles and Examples by Bell, et. al? This is a fascinating area. Thanks in advance, Richard
danh@halley.UUCP (Dan Hendrickson) (02/13/90)
In article <35@exodus.Eng.Sun.COM> rtrauben@cortex.EBay.Sun.COM (Richard Trauben) writes: > >In summary, can someone suggest a pointer into FT literature beyond >Computer Structures: Principles and Examples by Bell, et. al? >This is a fascinating area. > >Thanks in advance, > >Richard I am new to the FT world, and have found the book "Design and Analysis of Fault Tolerant Digital Systems" by Barry Johnson to be useful in getting up to speed.