[comp.arch] Fault Tolerance LONG

jimbo@tandem.com (Jim Lyon) (02/02/90)

-
In article <35300@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>My problem is that it WASN'T about technology, and we ought to turn
>it back into a technology discussion that might be useful:
>        a) Various people do fault-tolerance various ways.  How about
>        people who know posting some things to explain how they work,
>        and what the strengths and weaknesses of the various ways are?

Now that the discussion is back to technology, I'll be happy to put
in my two-cents worth.  The following is NOT to be taken as a pronouncement
on Tandem strategy, about which I know relatively little (and am willing
to say even less).  In general, the following represents merely the
opinions of one lowly grunt (me).

Before talking too much about fault tolerance, it is important to know
a little bit about a fault model.  In general, most people think of a
fault in a component causing one of the following two behaviors:
a)  It stops dead.  or
b)  It goes insane.
The latter case is VERY difficult to deal with.  People put it very
much work to try to translate it into the first case.  TMR schemes
try to shoot the insane processor before it manages to poison the
outside world.  At higher levels, software is very frequently full
of tests for violated assertions (which is evidence of insanity), in
an attempt to kill the software component before the insanity spreads.
In the latter case, one is not always 100% successful.

Hardware faults are frequently classed as either transient (a cosmic
ray flipped a bit in memory) or hard (a transister is broken and a
bit in memory will always return zero).

Software faults are harder.  They are frequently classed as either
Bohr bugs or Heisenbugs.  A Bohr bug is a deterministic bug (every
time I try to run hack, the system fails).  A Heisenbug, on the other
hand, is a nondeterministic bug (if an interrupt occurs in a
particularly sensitive part of code, we'll corrupt a data structure
so that the next user of the data structure will die).  This breakdown
also applies to hardware, but people don't usually talk about it in
that context (I don't know why).

A good QA process will catch nearly 100% of the Bohr bugs and many,
but by no means all, of the Heisenbugs in a product.  It is realistic
to expect that a released product have more Heisenbugs left than
Bohr bugs.

These days, a typical complex system is built in lots of layers.  The
reliability of the system is the product of the reliabilities of each
of the layers [eg, hardware, microcode, operating system, database,
application, etc].  In a normal world, a failure in one layer will
cause the immediately higher layer to either:
a) notice the failure and correct for it, or
b) notice the failure and throw up its hands (because it doesn't
   know how to correct for it), or
c) fail to notice the failure, thereby failing itself.
In case (a), that's the end of it.  Your system has successfully tolerated
a fault.  In cases (b) and (c), the failure in one layer has just been
translated to the failure of the next layer up, and we need to repeat.
Notice that in most systems, the uppermost layer is a person (liveware?).

So, a failure at a low layer propogates up and up, until we find a layer
smart enough to deal with the failure.  However, not every failure starts
at the bottom.  Operating systems fail.  Device drivers have bugs.  Database
managers have bugs.  Applications have bugs.

All of this having been said, the question still remains:
  Why is checkpointing good/bad?

The prime virtue of checkpointing is that you can do it again at each layer.
Conceptually, you introduce a new layer between each of your previous layers.
Where before, layer 3 made direct requests of layer 2, now we have layer
3 use a new layer 2a.  Layer 2a maintains interfaces to two (or more)
instances of layer 2.  Should one instance of layer 2 fail, layer 2a
transparently redirects requests to the other instance of layer 2.  The
client, layer 3, never sees the failure.

Of course, this gives rise to a couple of requirements:
a)  All of the requests to a replicated layer must be idempotent.  If I ask
    an instance of layer 2 to debit my bank account by $100, and if fails
    after doing so but before reporting success, I don't want the other
    instance to debit another $100.  There are well-known schemes (using
    sequence numbers) to turn non-idempotent requests into idempotent ones.
b)  If a layer maintains state about its clients, this state needs to be
    kept synchronized among the various instances of that layer.  Typically,
    they do this by informing each other whenever they change their state.
    This is what we call a checkpoint.

If we do this replication and checkpointing at every layer of the system,
we can acheive very high reliability.  It turns out that we can mask
all of the single hardware failures (both transient and hard), and well
as most of the software Heisenbugs.  This technique does not mask the
Bohr bugs; if a layer contains a Bohr bug such that a certain request
causes an instance of that layer to fail, then each instance of that
layer will end up failing, one at a time.  [One of these days I'll send
something to alt.computers.folklore about the bug that caused 34 CPUs
to fail sequentially, at 4-second intervals.]

In summary, checkpointing not only allows you to survive most of your
hardware failures, but also most of your operating system bugs, most
of your database manager bugs, most of your communication protocol bugs,
most of your transaction manager bugs, and even many of your application
bugs.

So, why doesn't everybody use checkpointing, all of the time?  In
particular, why didn't Tandem use checkpointing in the S2?

Well, ...
1)  It's hard.  No doubt about it, it's frequently twice as hard to design
    a piece of software with checkpointing as it is without it.
2)  If you already have a piece of software that was designed without
    checkpointing, it's VERY hard to add it as an afterthought.
3)  If you insist on retrofitting checkpointing into something that wasn't
    designed with it in mind, you are likely to see VERY poor performance.

Remember that the Tandem S2's mission in life is to run Unix.
If you want a machine to run Unix, you don't do checkpointing.  If
you really wanted to, you could, with a huge amount of work, put
checkpointing into the Unix kernel.  You couldn't, even if you wanted
to, manage to put checkpointing into any significant fraction of the
third-party software (like database mangers, bizarre comm managers,
applications, etc.).  So, the amount of reliability that you could
add via checkpointing is exactly that you could mask some of the
Heisenbugs in the kernel.  There just aren't that many there.

So, what DO you do if you want a high-reliability Unix system?
You:
a)  Use TMR on the processor and memory.  We've just tolerated all of
    the single faults from these components.
b)  Duplex the disks.  We're now in a position to tolerate hard disk errors.
c)  Beef up the device drivers.  A large number of the panics that Unix
    systems experience are directly traceable to a transient error of
    one sort or another on a device.  Put in code to recover from these
    errors.  Use some aggresive test strategies to make sure that this
    code actually works.
d)  Clean up a small number of other places where the kernel just gives
    up (primarily due to resource exhaustion).
The result is a system which:
a)  Isn't perfect.
b)  Will fail far less often than a conventional Unix system.

SUMMARY:

If you want the highest degree of fault tolerance possible, design it
from the start to use checkpointing [If you come work for Tandem for
a few years, you'll learn how].

If you can't design it [or redesign it] from the start, don't use
checkpointing.  Depending on where the real reasons for failure are,
you may or may not benefit from running it on a system that uses
checkpointing at a lower level.

I hope this has been informative and hasn't sounded too much like a
Tandem commercial.  If not, well, I'll put on my asbestos suit now.

-- Jim Lyon
-- Tandem Computers
-- jimbo@tandem.com

dhepner@hpisod2.HP.COM (Dan Hepner) (02/07/90)

From: jimbo@tandem.com (Jim Lyon)

>In article <35300@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>>        a) Various people do fault-tolerance various ways.  How about
>
>Now that the discussion is back to technology, I'll be happy to put
>in my two-cents worth. 

We'll thank John Mashey for his contribution.

>fault in a component causing one of the following two behaviors:
>a)  It stops dead.  or
>b)  It goes insane.
>The latter case is VERY difficult to deal with.  People put it very
>much work to try to translate it into the first case.  TMR schemes
>try to shoot the insane processor before it manages to poison the
>outside world.

Are you suggesting that the techniques available to a TMR/QMR
designer are not totally effective in isolating an insane processor?
Could you offer an example of a type of insanity which can
spread through a voting/comparison barrier?

[a lot of good stuff on software faults deleted]

>In summary, checkpointing not only allows you to survive most of your
>hardware failures, but also most of your operating system bugs, most
>of your database manager bugs, most of your communication protocol bugs,
>most of your transaction manager bugs, and even many of your application
>bugs.

and later:

>If you want the highest degree of fault tolerance possible, design it
>from the start to use checkpointing [If you come work for Tandem for
>a few years, you'll learn how].

Could you clarify the claim here?  It sure seems you suggesting 
that the checkpointing system, including HW and SW, is inherently 
more reliable than would be a TMR system which had had an equivelent
amount of effort devoted to reliability enhancement of its SW. But
most of the excellent recommendations made WRT the SW are equally
applicable to both products, or even non-FT products.

What is unique to checkpointing is the notion that each SW layer has 
available to it a backup process(s), and that the hardware checkpointing
mechanism can be used as a tool for abandoning work which led to
some failure, succeeding in avoiding the Hiesenbug panic case.

As long as we're willing to pay substantially increased SW development
costs, we might consider what else we might get for our money.
There are other tools which can be used to attain high reliability,
and the basic save state/ fall back on failure mechanism can be used 
in the absence of even a backup process, let alone a process in a different 
memory space. Is there really something offered by checkpointing to another 
memory space which makes such SW inherently more reliable?  And from
there, is there really something offered by completely checkpointing
HW/SW systems which is not achievable on TMR/QMR systems?

>I hope this has been informative and hasn't sounded too much like a
>Tandem commercial.  If not, well, I'll put on my asbestos suit now.
>
>-- Jim Lyon
>-- Tandem Computers
>-- jimbo@tandem.com

Thanks a lot.

Dan Hepner

dhepner@hpisod2.HP.COM (Dan Hepner) (02/08/90)

From: rtrauben@cortex.Sun.COM (Richard Trauben)

[hopefully someone can answer the excellent "just how do you get 
 it stopped" questions]

>Presumably no-one is interested in dumping the state of a failed PE-pairs
>write-back$; execution would resume from last process checkpoint. 

Hmm. If what you mean by "PE-pairs" is what is generally called
Quad Modular Redundancy (QMR), with two lockstep processors constituting
a PE, and two of those constituting a logical processor, there is
no requirement for checkpointing; the instruction will be successfully
executed.

If what you mean however by a PE-pair is two lockstepped processors,
which upon detection of a miscomparison take themselves offline,
indeed a checkpoint needs to be done to preserve the state for
some backup processor.

Part of the checkpointing mechanism is the necessity to abandon
all effects of processing done after the checkpoint by the failed  
processor, which includes any write-back state.

>How about resuming from the checkpoint and unintentionally resending 
>redundant mass store and datacom messages.

The checkpoint itself must be atomic, in that it must complete
fully or not at all.  "Half-checkpoints" must be seen as effects
of processing done after the last successful checkpoint, and
be abandoned.

The IO request atomicity can be addressed as part of the problem of 
checkpoint atomicity. Once the atomic checkpoint mechanism is developed, 
the initiation of IO requests can be incorporated, so that the initiation 
of an IO request happens only at the time of a successful checkpoint.
From the recovery processor's point of view, either the checkpoint/
IO request happened or it didn't, and that is discernible.

This has covered the case of processor failure, and guaranteed
that the request has been issued once and only once.  As noted,
reissuing a disk write after an arbitrary amount of other activity
has happened could raise real havoc.

Left uncovered is the potential for the requestee of the IO request
to loose it, but that's a different question.

 I/O caching and TCPIP
>packet sequence numbers might conceal some of these problems but probably
>not all.

WRT disks, it seems essential to get it perfect.  Some comm might be different.

>Richard

Dan Hepner

rtrauben@cortex.Sun.COM (Richard Trauben) (02/09/90)

Dan Hepner responds to a thread about redundant mass-store and datacom
requests wrt rolling back to a checkpoint after a PE-pair failure:

>> The IO request atomicity can be addressed as part of the problem of 
>> checkpoint atomicity. Once the atomic checkpoint mechanism is developed, 
>> the initiation of IO requests can be incorporated, so that the initiation 
>> of an IO request happens only at the time of a successful checkpoint.
>> From the recovery processor's point of view, either the checkpoint/
>> IO request happened or it didn't, and that is discernible.

A consequence of what you suggest is that a unique checkpoint must 
exist for every packet in a duplex conversation (over a link) where there
are dependencies between talker and listener (debit/credit): as in 
one checkpoint per TCP/IP or X.25 packet. While it works, I suspect it
becomes THE bottleneck in packet transmission rates and might lead to
a very high frequency of checkpoints per second. 

Richard

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (02/09/90)

In article <38@exodus.Eng.Sun.COM> 
	rtrauben@cortex.EBay.Sun.COM (Richard Trauben) writes:
>Dan Hepner responds to a thread about redundant mass-store and datacom
>requests wrt rolling back to a checkpoint after a PE-pair failure:
>>> The IO request atomicity can be addressed as part of the problem of 
>>> checkpoint atomicity. Once the atomic checkpoint mechanism is developed, 
>>> the initiation of IO requests can be incorporated, so that the initiation 
>>> of an IO request happens only at the time of a successful checkpoint.
>>> From the recovery processor's point of view, either the checkpoint/
>>> IO request happened or it didn't, and that is discernible.
>
>A consequence of what you suggest is that a unique checkpoint must 
>exist for every packet in a duplex conversation (over a link) where there
>are dependencies between talker and listener (debit/credit): as in 
>one checkpoint per TCP/IP or X.25 packet. 

The checkpointing systems that I'm aware of, do not perform a
checkpoint on every IO. Instead, they treat IO as a form of message
traffic.  Whenever a process receives a message (does a read), a copy
of the message is also put in a special queue. When the process is
checkpointed, the queue is cleared. So, yes, there is an overhead per
application-level IO operation. But, no, the overhead is not a
complete checkpoint. In the case of a read from a read-only file, I
suppose that the "message" could be a description of the read
request, instead of being a copy of the actual data.

Reliability is never without a price, but the price can be a lot
lower in selected cases. For example: just ask the customer to try
again. Also, "end to end" is a more general concept than some people
seem to think. Suppose that a salesman sends orders to a central
system, but also keeps a copy in his local machine.  At intervals,
the salesman can have his machine prepare a summary, compress it, and
send it in when the telephone rates are low.  The central system can
use late-night cycles to check summaries against the online data.
This sort of lazy checksumming is really cheap, and _eventually_ the
files are as correct as any other method could get them.

-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science