[comp.lsi] Synchronizer failure; MTBF

mark@mips.COM (Mark G. Johnson) (09/04/89)

It's impossible to build a perfect arbiter/synchronizer, one that will
unambiguously decide in bounded time, which of two asynchronous inputs
arrived first.  However, with careful design the probability of failure
can be made quite small (but not zero).

Recently chip vendors have introduced "metastable hardened" flip-flops,
characterized arbiters, "synchronizer qualified" PALs, and so forth.
These manufacturers are working hard to provide reliable, easy-to-use,
inexpensive chips that have a low probability of failure.  UNFORTUNATELY
the synchronizer failure data they supply is given in a form that is
all too often misunderstood by customers (system designers).  Manufacturer's
datasheets {e.g. 74F786, 74AS3374} loudly trumpet the fact that, if you
operate their chip under such-and-such conditions, the Mean Time Between
Failures will be 1 century (100 years; 3E9 seconds).

The problem is the phrase "Mean Time Between Failures"; it conjures up
an intuitive model that is comfortable but mathematically incorrect.
For example: a certain brand of wristwatch has an MTBF of 5 years;
what's the probability that a given watch will still work properly
after 5 years?  Intuitively the answer seems to be 50%; this is the
way "Mean"s ought to work.  But 50% is incorrect; we've let the word
"Mean" mislead our intuition.  There is actually a _37%_ probability
the watch will still work properly after 5 years (1.00 * MTBF) of time.

I hypothesize that when people think of a "mean", they imagine an
average value of a **distribution**which**is**approximately**symmetric**.
The very familiar Gaussian (bell curve) distribution is indeed
symmetric, as are those of many physical phenomena in everyday life.
Sadly, the distribution of random failures (parameterized by MTBF)
is distinctly *asymmetric*, so intuition is in this case misleading.
Intuition says 50% but the answer is 37%.


Philosophizing aside, the distribution of random failures is given by

      Probability( a failure will
                   occur between          =   1.0 - exp(-T/m)
                   time 0 and time T )

where the parameter m>0 is called the Mean Time Between Failures.
Note that the distribution is highly skewed, towards early failures
(failures before T=m).
_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-


Now let's put this equation to use.  Suppose we are system designers
at Hewlett International Digital Sun Business Machines, Inc. (HIDSBM),
and we are going to build & sell a system that contains a synchronizer.
HIDSBM plans to sell 50,000 of these systems, and the systems have a
useful life of 5 years before they become obsolete and are thrown away.

First, let's compute the probability that one or more of the 50,000
systems will fail sometime within 5 years due to a synchronizer failure:

	P = 50000 * [1 - exp(-5/100)]
	P = 1

So it is virtually guaranteed that one or more customers will have their
HIDSBM system mysteriously fail at least once during the 5 year lifespan,
if the synchronizer MTBF is 100 years.

The Big Boss at HIDSBM has handed down an edict.  She wants us to design
the system such that the entire customer base (50,000 systems) will
experience fewer than 100 synchronizer failures over the 5 year product
life.  We have to choose a synchronizer that gives this performance.

Let's decide what she really meant.  "There will be fewer than 100
synchronizer failures" means "the probability of 100 or more failures is
small".  Let's see what happens if we let the probability of 100 or more
failures be 33 percent.  (Thus we have a 1-in-3 chance of being fired by
the Big Boss because of >99 synchronization failures in the field).


           P(100 or more failures
 0.33  =     out of 50,000 units     =  (50,000 / 100) * [1 - exp(-5/m)]
             over a 5 year period)

solving,
    m  =  required MTBF of synchronizer = 7,573 years


So, in this approximately-real-world example, the chip manufacturer's
proud claim "1 century MTBF" isn't good enough.  We need a synchronizer
that's 76 times better.


To conclude:  Reliable systems require _very_ reliable synchronizers.
Rather than rely upon intuition, it's best to use the random failure
probability equation.  For some types of design problems, the required
MTBF of synchronizers can reach into tens of thousands of years.

Chip manufacturers aren't being deliberately misleading; it's just that
the phrase "Mean Time Between Failures is 100 years" has a mathematical
implication which is different than what intuition would predict.
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
	(408) 991-0208    mark@mips.com  {or ...!decwrl!mips!mark}

amos@taux01.UUCP (Amos Shapir) (09/05/89)

In article <26811@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
	[An excellent analysis of MTBF - if you have not read it yet,
	go back and read it now]
|Philosophizing aside, the distribution of random failures is given by
|
|      Probability( a failure will
|                   occur between          =   1.0 - exp(-T/m)
|                   time 0 and time T )
|
|where the parameter m>0 is called the Mean Time Between Failures.
|Note that the distribution is highly skewed, towards early failures
|(failures before T=m).

Maybe we should use the term Half Life, used by physicists to compute
radioactive decay; it should be computed as
	HL = ln(2)*MTBF

-- 
	Amos Shapir		amos@taux01.nsc.com or amos@nsc.nsc.com
National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel
Tel. +972 52 522261  TWX: 33691, fax: +972-52-558322
34 48 E / 32 10 N			(My other cpu is a NS32532)