[comp.lsi] Arbiter / Synchronizer failure; MTBF

mark@mips.COM (Mark G. Johnson) (09/15/89)
  
In the previous posting <<26811@obiwan.mips.COM>> I used an approximation-
formula to simplify the probability expressions, without explicitly calling
attention to the approximation.  Article <2280011@hpsal2.HP.COM> by
saxena@hpsal2.HP.COM (Nirmal Saxena) pointed out the inexactitude;
unfortunately, his modification was incorrect.

Sticklers-for-mathematical-precision might perhaps be interested in the
exact expressions, without using approximation formulae.  They appear
below.  Engineering approximations were given in <26811@obiwan.mips.COM>.
I recommend the engineering approach; among other advantages, it provides
expressions that are far easier to invert.



  A single part whose Mean Time Between Failures is "m" units of time:
***************************************************************************
*  Prob of a failure between time 0 and T is     P(fail) = 1 - exp(-T/m)  *
*  Prob of not-failure is                    P(not-fail) = exp(-T/m).     *
***************************************************************************


To compute the probability that one or more units out of a population of
50,000 will fail within 5 years, we simply compute the probability
that zero units will fail, and then realize that P(one or more failures)
is equal to 1.0 - P(no fails).

The probability of 0 failures among 50,000 units, is just the
probability that the first one doesn't fail, times the probability the
second one doesn't fail, times..... (i.e. P(no-fail) to the 50,000 power).
If the MTBF is 100 years and we want to find the prob of 0 failures after
5 years:

    P(0 failures in 50,000 units) = [exp(-5/100)] ** 50000  ==  exp(-2500)

So the probability that there are one or more failures in the 50,000 units
is one minus P(no-fails); that is, [1.0 - exp(-2500)].  (very nearly 1).




In general we want to know the probability of (fewer than K failures)
over a specified time interval.  The original article stipulated that the
Big Boss would fire the engineer if, during the 5-year product lifetime
there were 100 or more failures out of 50,000 installations in the field.
Thus the engineer wanted to have a large probability of (fewer than 100
failures).  In the example we solved for the MTBF that gave a probability
of (100 or more failures) equal to 0.33; that is, the probability of
(fewer than 100 failures) was 0.67.


    If each of N identical parts has an MTBF equal to "m" units of time,
*****************************************************************************
*                                                                           *
*     P(out of N parts, fewer than K failures from time 0 to time T)   =    *
*                                                                           *
* Sum from i=0 to i=(K-1)  {C(N,i) * (1 - exp(-T/m))^i * (exp(-T/M)^(N-i)}  *
*                                                                           *
*****************************************************************************
 where the binomial coefficient C(N,i) is   N! / (i! (N-i)!)   and C(N,0)==1


So, in our example we set the probability equal to 0.67 and solve for m.
{Now you see why the engineering approximation is sometimes useful;
solving for m in the exact expression above is messy}.  Utilizing a
numerical solution method, we find that m = 2619.6 years is the required
MTBF to give an 0.67 probability of (fewer than 100 failures over 5 years 
among 50,000 parts).

Recall that the chip vendors proudly boast "1 century MTBF".  So, using
the exact formula we find that this MTBF is 26 times too small; the Big
Boss will fire the design engineer.  The engineering solution agreed;
it was a bit more conservative, dictating an MTBF of 75.7 centuries to
achieve fewer than 100 failures among 50,000 parts over 5 years.
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
	(408) 991-0208    mark@mips.com  {or ...!decwrl!mips!mark}