[comp.arch] Fault Tolerant Micros

dhepner@hpisod2.HP.COM (Dan Hepner) (01/18/90)

From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay)
>Some
>machines had redundant hardware (e.g. two ALUs) and could reconfigure
>to cut failed units out of the "processor complex". I don't see
>micros following this path.

Fault tolerant micro machines are following a path at least something
like this.  The general scheme, as implemented by Stratus, Sequoia,
and Tandem [S2] involves redundant CPUs, and a scheme for 'cutting
out' any offender who gets a wrong answer.

>Live CPU recovery has become much less interesting since
>multiprocessors came along. With the right software, a failed
>processor does not imply a failed process. For example, Tandem
>checkpoints each process regularly, so that a different processor can
>do a prompt checkpoint-resumption.

Tandem has apparently decided that this was not the correct model
to implement fault tolerance, although I've searched for and not
found yet an official statement on just how the S2 does do FT.

 The CPU and IO interconnects have
>to be up to it, of course (dual port those disks).

Dual porting is a classic question in FT.  You're going to use
redundant disks, of course.  Once you have redundant disks, and
have paid attention to your interconnect scheme to insure that
a path failure won't take down both disks, then dual porting
will not enhance FT, if FT is defined as "the ability to sustain
any single point of failure".


 And besides: if a
>master/checker pair of CPUs disagree, which one was the one that
>failed? Better to ignore them both and force the board into self test
>mode.

This scheme works fine for Stratus, but one can get to roughly
the same place by using three, and tossing out any one which
disagrees with the other two.

>Well, nonstop machines are ruggedized and rated for e.g. sudden
>overpressures (no kidding). This might influence a chip company to
>change its chip packaging, but not its chip design.

Maybe you could expand on this.

Great.  A discussion of fault tolerance.

Dan Hepner
dhepner@hpda.hp.com

rwpratt@Neon.Stanford.EDU (Robert W. Pratt) (01/19/90)

In article <13910004@hpisod2.HP.COM> dhepner@hpisod2.HP.COM (Dan Hepner) writes:
>From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay)
>>Live CPU recovery has become much less interesting since
>>multiprocessors came along. With the right software, a failed
>>processor does not imply a failed process. For example, Tandem
>>checkpoints each process regularly, so that a different processor can
>>do a prompt checkpoint-resumption.
>
>Tandem has apparently decided that this was not the correct model
>to implement fault tolerance, although I've searched for and not
>found yet an official statement on just how the S2 does do FT.

I think a more accurate statement would be that Tandem decided not
to do checkpointing for fault tolerance under UNIX system V.3, 
since that would have called for radical (IMHO) changes to UNIX.
Guardian (Tandem's proprietary OS) still uses checkpointing.

Disclaimer:  I did not work on the S2, and the above is
exclusively my opinion,not Tandem's or Stanford's.

                       Bob P.

-- 
                             Bob Pratt
INTERnet:   pratt@jessica.stanford.edu     (much more reliable)
            pratt_robert@comm.tandem.com   (checked more, but flaky)
USMail:     2225 Sharon Rd. #323 Menlo Park, CA. 94025

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (01/19/90)

In article <13910004@hpisod2.HP.COM> dhepner@hpisod2.HP.COM
	(Dan Hepner) writes:
>From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay)
>>Live CPU recovery has become much less interesting since
>>multiprocessors came along. With the right software, a failed
>>processor does not imply a failed process. For example, Tandem
>>checkpoints each process regularly, so that a different processor can
>>do a prompt checkpoint-resumption.
>
>Tandem has apparently decided that this was not the correct model
>to implement fault tolerance, although I've searched for and not
>found yet an official statement on just how the S2 does do FT.

The most basic thing is to contain the damage, so it's usual to use
messages between machines that don't share memory. The next problem
is the data that was inside the failed thing. There are several ways
out:
	1- whatever it was doing is dead. (Rarely acceptable, but if
  	   the customer is just buying a lottery ticket,
	   I suppose you can ask him do it again.)
	2- another CPU recomputes the lost data from its copy of
	   the last checkpoint, and copies of all the messages
	   and signals sent to the dead machine. 
	3- another CPU has been computing in parallel, and contains
	   data that should be identical to the lost data.

Case 1 is an application-level choice, but still needs error
detection. Case 2 is done with software and with error detection.
Case 3 can be done like case 2, or it can be done with a master-and-
two-checkers. As you pointed out, the three chips can hold a vote,
and decide that only two of the three contain valid data.  That can
be a lot of data: with on-chip caches, and PIDs, it can also be data
from several different processes.

>>Well, nonstop machines are ruggedized and rated for e.g. sudden
>>overpressures (no kidding). This might influence a chip company to
>>change its chip packaging, but not its chip design.
>Maybe you could expand on this.

I'm probably out of touch with ruggedizing, but there are companies
that specialize in it. They like things that tolerate flexing. They
like sealed spaces (to keep out salt air and conductive dust and
tobacco tar). They like to coat things: a grease layer is used on
some automotive chips. They like hermetically sealed chips, and they
used to like ceramic over plastic. They used to worry about pin
corrosion, but I don't know what they think of TAB. The system specs
can involve higher ambient temperatures, explosion (overpressure),
shock, high voltage shorted onto the rack, ground currents on the
Ethernet shield, ambient RF, industrial-strength noise on the power
lines, locks on the rack door, having to push a telex signal through
a one-henry coil.  I recall a system that failed from being too near
a gamma-ray source: but that wasn't in the spec.

(The gamma rays were just enough so that an EPROM would forget one
bit after about a month. We solved this with squared-law shielding.
That is, we pushed the desk further down the hall.)

-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science

dhepner@hpisod2.HP.COM (Dan Hepner) (02/03/90)

From: mash@mips.COM (John Mashey)
>
>	a) Various people do fault-tolerance various ways.  How about
>	people who know posting some things to explain how they work,
>	and what the strengths and weaknesses of the various ways are?

While various people do FT various ways, it is reasonable to decide
to discuss a restricted set.  If the space shuttle needs five
computers, programmed by different companies using different
algorithms, that's fine, but would appear beyond the scope of how 
FT is best done using microprocessors.

The ways cleanly divide into two: checkpointing vs. redundantly
executing instructions with enough  processors to guarantee
completion regardless of any failure.  All FT schemes should
include some means to detect failure, of course.

"The problem" which must be solved by any FT scheme is how
to get each instruction  executed once and only once WRT
the user.  Checkpointing solves this by saving enough
process state in a place available to another processor
to restart the process in the event of original processor 
failure.  Any post checkpoint execution by the failed
processor is guaranteed to be abandoned.

Redundant processor machines always execute the instruction
with more than one processor (3 for Tandem, 4 for Stratus),
and compare results.  Miscomparisons result in reliable detection
of "crazy processors", including cache, logic, or whatever.

>	b) Of particular interest in this discussion: are there features
>	in fault-tolerant OLTP systems that:

OLTP, while clearly correlated with FT, is a separate topic.  Tandem's
Guardian line does seem to assume that the best reason one might need
FT is for OLTP, but other markets for FT exist as well.  Communications
applications stick out, e.g. telecom.

>		a) are in UNIX
>		b) aren't in UNIX, but could be
>		c) aren't in UNIX, but would require complete rewrites
>			to get them there.

While there are a bunch of requests for UNIX enhancements  from
the OLTP community, these are OLTP requests, not FT requests.

There's also a bunch of requests for UNIX enhancements from
people who advocate reliable OSs.  Laudable no doubt, but again
not FT requests.

Ideally FT would exist completely in the hardware, and present
a platform to the OS which looks like a non-FT machine.

The reality is that this can't be quite true.  FT vendors will
always be required to supply whatever kernel support that their 
idiosyncratic implementation requires, and an OS port to such a 
machine will always be more difficult than on a non-FT platform.

However, by and large the progress of UNIX, and the related progress
of DMBS software should proceed without undue concern for the needs
of FT.  The ball is in the other court.  FT machines should (and are)
being designed with the needs of UNIX and DMBS software in mind.


>		d) What's the tradeoff between:
>			degree of software fault-tolerancy
>		and
>			ability to run standard software, with no changes

"Software fault tolerance" implies that _someone_ must figure out
when and what to checkpoint, or that an extreme penalty will be paid
because everything is checkpointed.  How much standard software
has such checkpointing already programmed in?  None, of course.

>Anyway, I'd observe that from the publicly-available data, it is clear that
>the 2 Tandem product lines don't really overlap very much, and are aimed
>at different markets, for different reasons, and hence, trying to read too
>much into this about the merit or lack thereof of a specific technical
>feature just doesn't make sense.

I guess we just come to different interpretations of
publicly available, and maybe ambiguous information. It still looks
to me like a technological generation change.  Time will certainly tell.

>the VAX meant that DEC thought PDP-11s were Wrong Things, of course I'd
>have objected.  Unless my memory fails, I thought DEC made something like
>$1B last year on PDP-11-based products... 10 years after the introduction of
>the VAX.  Although the two overlapped in some areas, they didn't at all in
>others.  

I guess what is at the core of the disagreement over what is appropriate
for discussion is that many of the problems of changing from older to
newer technology are universal to all successful companies, and discussing
one instantiation doesn't seem directed at the subject being discussed.

HP continues to sell "traditional" 16 bit HP-3000s, while having moved to
32 bit RISC. Most observers speculated when they heard the HP-PA 
announcement that the 16 bit architecture was being moved away from
by HP, that HP believed it could do better than before. HP surely never 
concluded that traditional HP-3000s were "wrong things", nor of course 
will Tandem ever come to such a conclusion over a product which is 
on par with the HP-3000 for being a successful both commercially and 
technically.

Obsolescence does not imply wrongness.  As shown by the PDP-11, it
doesn't imply lack of commercial success, and certainly doesn't
imply lack of support.

>As a matter of style, I believe that it is much better to carefully
>label speculations as such, and ask questions, than to make strong-sounding
>statements that can easily mislead the casual observer.

I'll accept your advice, and thank you for indeed a better way
of presenting the case.  But I'll point out that anyone
who is misled by a strong sounding statement in comp.arch
is a candidate for selling a bridge to.

>I have no desire to suppress such discussions, as the interactions of
>technology and business are extremely important to understand.

The problem is that companies, _all_ long lived technology companies,  
control technology for business reasons.  Taking company announcements 
at face value is not the way to understand such interactions.

>-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>

Dan Hepner      

Not a statement of Hewlett-Packard Co.

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (02/06/90)

In article <13910010@hpisod2.HP.COM> dhepner@hpisod2.HP.COM 
	(Dan Hepner) writes:
>Ideally FT would exist completely in the hardware, and present
>a platform to the OS which looks like a non-FT machine.

I'm not so sure. How would this catch the software bugs that Tandem's
scheme does catch?

>Redundant processor machines always execute the instruction
>with more than one processor (3 for Tandem, 4 for Stratus),
>and compare results.  Miscomparisons result in reliable detection
>of "crazy processors", including cache, logic, or whatever.

There is "lockstep" redundant execution, and then there are
looser forms.

Lockstep redundancy is very simple to build, but it cannot catch
Heisenbugs - order dependencies that aren't supposed to be in the
software, but are anyway. Looser redundancy schemes declare
synchronization events at (say) a kilohertz. This is no where near as
clear cut, because processes may not have been scheduled in the same
order on all machines. Both interrupts and traps become interesting
topics in such a system. And you can't expect all machines to reach
the same synchronization event at the exact same moment.

However, a loose redundancy scheme is essentially the same as a
checkpoint scheme, except for latency.  A redundant process has been
out there fighting for cycles all along.  Checkpoint systems recover
by running a shadow process forwards from the last checkpoint.  So,
the Space Shuttle uses redundancy, because they don't want The Pause
That Refreshes to happen during reentry.
-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science

donnh@ziggy.SanDiego.NCR.COM (Donn Holtzman) (02/06/90)

In article <13910010@hpisod2.HP.COM> dhepner@hpisod2.HP.COM (Dan Hepner) writes:
>
>Ideally FT would exist completely in the hardware, and present
>a platform to the OS which looks like a non-FT machine.
>
From my perspective one problem that exists with a "HW only" solution
to FT is the issue of SW failures. As the gentleman from Tandem
pointed out there are a class of faults (Heisen Bugs) which are very
timing and case dependant. A loosely coupled approach, such as
Tandem's, will recover from many SW based faults simply because the
timing and load characteristics are different on another processor. If
the bug causes your kernel to hang, TMR or pair-and-spare approaches
won't succeed. Performance is certainly and issue but one can trade
check pointing overhead for recovery speed (at least in the OLTP
arena).

On the other hand the "HW only" approaches are easier to explain and
sell. They certainly are conceptually simpler, if not actually simpler
to implement correctly. 

>The reality is that this can't be quite true.  FT vendors will
>always be required to supply whatever kernel support that their 
>idiosyncratic implementation requires, and an OS port to such a 
>machine will always be more difficult than on a non-FT platform.
>
This is a good point. I would be surprised if Tandem didn't have to
make kernel changes to make their machine work. But in this day of
standards and narrow market windows it was probably easier to sell
this approach to management then a large SW effort (kernel changes or
no kernel changes) to support a loosely coupled approach.

Interesting stuff.

Donn Holtzman
NCR E&M San Diego

Donn.Holtzman@SanDiego.NCR.COM

dhepner@hpisod2.HP.COM (Dan Hepner) (02/07/90)

From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay)
>
>>Ideally FT would exist completely in the hardware, and present
>>a platform to the OS which looks like a non-FT machine.
>
>I'm not so sure. How would this catch the software bugs that Tandem's
>scheme does catch?

Tandem's checkpointing scheme catches SW bugs by deliberate SW design, 
not because the scheme provides some inherent resistance to such. This
resistance has proven to be limited by its own bugs. What is as yet 
unestablished is whether the schemes described for detection of SW bugs 
are the best available, or if they are the best available, that they can't 
be equally applied to a redundant CPU implementation.

We can assume we would prefer _all_ our applications to run reliably in
the face of HW failures.  In a checkpointing system, this means
we must incorporate checkpointing logic, as compared to running them
straight off on a redundant CPU machine.  If we need to "harden" certain 
SW, making it resistant to SW bugs while already being impervious to
HW failures, we are free to do so on a redundant CPU machine.

>Lockstep redundancy is very simple to build, but it cannot catch
>Heisenbugs - order dependencies that aren't supposed to be in the
>software, but are anyway.

Lockstep redundancy, or the alternatives for that matter, are not designed 
to catch SW bugs at all, Heisenbugs or whatever.  Deliberate SW design to 
achieve resistance is the only technique that catches SW bugs.

This is not to deny the conceptual difference between "single memory
space" and "multiple memory space" machines, although the line can 
be a bit blurry.  Intuitively, it sure seems more likely that the global 
state will be more easily irretrievably corrupted if there is only one 
memory space.  But is there any actual evidence that those techniques 
(they're all SW techniques) which minimize the potential of irretrievable 
corruption due to SW bugs apply equally well to both systems?

Ultimately we have to just trust the released SW.  If some programmer 
writes precisely the line of code which corrupts the entire system, and 
that line of code manages to get past whatever QA process that is in 
place, we have no defense.

 Looser redundancy schemes declare
>synchronization events at (say) a kilohertz. 
 [...]
>However, a loose redundancy scheme is essentially the same as a
>checkpoint scheme, except for latency.

Right.

Dan Hepner

dhepner@hpisod2.HP.COM (Dan Hepner) (02/07/90)

From: donnh@ziggy.SanDiego.NCR.COM (Donn Holtzman)

 If
>the bug causes your kernel to hang, TMR or pair-and-spare approaches
>won't succeed.

This is an interesting case. 

Kernel "Hangs" are actually double SW bugs: a combination of some original
bug and a lack of detection by failed assertion. 

There is a last defense TMR/QMR schemes have against such, the 
deadman switch initiated "fast reboot", which effectively converts
the hang into a "panic".

Which brings up the comparison of reaction to the more generic panic.

In order to make the case that loose redundancy is superior to
lock-step in its response time to panics, one must assert that
the backup loose processor will notice the failure of the primary,
and complete its takeover in less time than the primary could have
reset itself and achieved a similar state.  How does this comparison
turn out in real life?

The question of whether the state of the machine is sound enough to 
return to seems independent of the basic question; one can do a fast 
reboot and leave machine state mostly intact, maybe suffering a repeat.  
Alternatively one can bet on a checkpointed machine and suffer the 
same repeat.  Is there some fundamental difference that makes the
takeover from the checkpointed machine faster? 

 Performance is certainly and issue but one can trade
>check pointing overhead for recovery speed (at least in the OLTP
>arena).

Maybe you could elaborate here.

>>an OS port to such a 
>>machine will always be more difficult than on a non-FT platform.
>>
>This is a good point. I would be surprised if Tandem didn't have to
>make kernel changes to make their machine work.

There is a real dividing line: can you port the next kernel or
do you have to retrofit the new functionality into your existing,
80% proprietary code kernel.

>Interesting stuff.

Yes!

>Donn Holtzman

Dan Hepner