[comp.misc] Summary: MTBF definition etc.

chris@yarra.oz.au (Chris Jankowski) (12/13/90)
The following is an edited summary of all responses to my four questions:
I hope my editing did not distort the answers I received.
Many thanks to all respondents.


------------------------------------------------------------------------------
1. 
Can somebody tell me what is the exact definition of MTBF?

--------------
From: Jim Kummer <kummer@pogo.den.mmc.com>
Mean Time Between Failures.  Usually represented by a statistical 
distribution called the "bathtub curve" - dropping steeply at first 
(infant mortality), then leveling off (exponential mortality), and then
arcing back up further out (wear-out).  MTBF is the reciprocal of the 
failure rate, usually represented by the symbol "lambda".  This is 
because of the root in exponetial failure models, e raised to the power
"minus lambda t".

--------------
From: north@manta.NOSC.MIL (Mark H. North)
It's a parameter (mean) of the exponential distribution. The probability
of failure is usually modeled as

p(t) = 1 - exp(-t/MTBF).

--------------
From: mark@mips.COM (Mark G. Johnson)
  *********************************************************************
  *                                                                   *
  *      Probability( a failure will                                  *
  *                   occur between          =   1.0 - exp(-T/m)      *
  *                   time 0 and time T )                             *
  *                                                                   *
  *   where the parameter m>0 is called the Mean Time Before Failure  *
  *                                                                   *
  *********************************************************************

Note that *the* *distribution* *is* *highly* *skewed*, towards early
failures (failures before the MTBF).  The mathematical definition
of MTBF implies the following nonintuitive result:

     WEIRD FACT: If a doodad-widget has an MTBF of 5 years, what is the
                 probability that it will fail sometime between time = 0
                 and time = 5 years?  Answer: 63.2 percent!!  [1 - exp(-1)]

This buggs me; the word "mean" in MTBF sure makes me want to say the
answer ought to be 50%.  But it isn't, too bad.  Moral: be careful.

  WEIRD FACT #2: If a doodad-widget has an MTBF of 5 years, what is the
                 probability that it will fail sometime between time = 0
                 and time = 3.46 years?  Answer: 50 percent.
   
--------------
And another definition with subsequent correction from another respondent:

From: mcmahan@netcom.UUCP (Dave Mc Mahan)
It is the Mean Time Before Failure.  It is expressed in units of hours (usually)
and is the expected life of a product before you can expect half the units
to fail.  If your MTBF is 63,600 hours (ten years) and you have 1000 units,
you can expect 1/2 to fail within the first 10 years due to all causes.  It
is really a statistical approximation of how long something can be expected
to continue to work.

From: dtate@unix.cis.pitt.edu (David M Tate)
Actually, as you've defined it here, this is *median* time before failure.
For many lifetime distributions (including the frequently-invoked Weibull
and exponential distributions), the mean and the median are not the same.
For example, if individual product lifetimes are exponentially distributed
with mean 1 unit, then the *average* individual lifetime will converge to 1
as you look at more and more units, but the expected time before half have
failed will be t such that Pr{ lifetime <= t} = .5, which in this case would
be ln(2) = .693... time units.


------------------------------------------------------------------------------
2.
And how can I calculate availability of a system as a % of uptime per year
knowing its MTBF? And what is availability anyway?

--------------
From: Jim Kummer <kummer@pogo.den.mmc.com>
Availability is simply the percentage of time that the system is 
available - uptime divided by total time.  Of course, "total time" is 
that time that the system should have been up - that is, if your system
is only supposed to be operating 16 hours a day, then that is the value
that should be used in computing the denominator, i.e.
-   availability = uptime (hours)/(16*(number of days in sample))
also,
-    availability = 1 - downtime/total_time.

To estimate availability, you need MTBF, and also MTTR - Mean Time To 
Repair.  Then
-    availability = MTBF/(MTBF + MTTR).

--------------
From: mcmahan@netcom.UUCP (Dave Mc Mahan)
Availability is is a percentage expressed in the amount of down time due to
repair compared to the up time.  Usually down time for repair is expressed
as the amount of time it takes someone to figure out what board failed and
to plug in a new board, as this has proven to be the fastest method for
fixing something.  It also assumes that a good supply of spare boards that
work are available.  The math gets quite complicated if you assume a certain
probability of the spare boards not working, but a good approximation is
given by:

Availability = up_time / (up_time + down_time)

This assumes that down_time is very short as compared to up_time and that
the probability of a failure in a spare unit is very low.  This is usually
the case.

To figure out the availability, you need to know the expected failure rate
and the mean time to repair.  Expected failure rate is found from the MTBF.
It is 1/MTBF.  This value is expressed usually in units of 
failures/trillion-hours.  Something like a mica cap being used at .8 of
the rated voltage would have a failure rate of 4 FITs at 30 degrees C.
That means at that temp and voltage rating, you will see 4 failures every
trillion hours.  That's a failure of about 1 per every 28,500 years.  Not
much for such a part, but it adds up quickly when you start throwing a
couple of hundred parts on a board and some of them are active.  A transistor
has a failure rate of something like 60 FITs, depending on its usage and
ambient temp.


------------------------------------------------------------------------------
3.
And how can I calculate MTBF of a system built of components of known MTBF
ratings?

--------------
From kummer@pogo.den.mmc.com Wed Dec  5 11:04:33 1990
This can get complicated for a system with many components.  
Failure rates combine like electrical resistance.
Serial components add failure rates, or, in terms of MTBF, it
is the reciprocals that add.  
    1/MTBF(combined) = 1/MTBF(1) + 1/MTBF(2) + ...
And vice versa for the parallel redundant components.

You also must compute the combined MTTR, which is the weighted average 
of all the component MTTRs, where the weighting factor is inverse of 
the component MTBF, i.e.
-   MTTR(system) = MTBF(system)*(MTTR(1)/MTBF(1) + MTTR(2)/MTBF(2) + 
-                                MTTR(3)/MTBF(3) + ...).

There are computerized models to aid in these calculations.  One such 
was developed for NASA, called ARAM - Automated Reliability, 
Availability, Maintainability program.  It runs on a PC/AT.  Perhaps if
you inquire at one of the NASA offices, you can get ahold of this program.

--------------
From: mcmahan@netcom.UUCP (Dave Mc Mahan)
The aggregate MTBF of a system can be found by first calculating the FITs of
each part (1/MTBF).  Now, sum up all those FITs, and convert it back to
MTBF by using the formula 1/FITs.  If you have 1000 parts, each with a
an MTBF of 100 years, all stuck in a system, you can expect that system
to have an MTBF of 876 hours, or 36.5 days.  This example assumes 365 days/year.
As you can see, all those little FITs add up!!!

------------------------------------------------------------------------------
4. Documentation references etc.

---------------
From: mcmahan@netcom.UUCP (Dave Mc Mahan)
The best reference I know of is Mil Handbook 217D (or 217E, or whatever version
they are currently on).  It has all kinds of parts listed, correction factors,
formulas, and deratings.  This handbook is about 4 inches thick and is
double sided.  It is published by some arm of the US Govt or DoD, I'm not sure
which.

---------------
From: kell@mprgate.mpr.ca (Dave Kell)
I believe the most commonly accepted (excluding commercial computer
manufacturers) definition of these is found in the US DoD document
MIL-HDBK-217E (I think E is the current revision).  It describes
all of what you ask.

------------------------------------------------------------------------------
Thanks again.

      -m-------   Chris Jankowski - Senior Systems Engineer   chris@yarra.oz.au
    ---mmm-----   Pyramid Technology Corporation Pty. Ltd.  fax  +61 3 521 3799
  -----mmmmm---   1st Floor, 553 St. Kilda Road             tel. +61 3 525 1730
-------mmmmmmm-   Melbourne, Victoria, 3004       AUSTRALIA       (03) 521 3799

socialism, n. - a tortuous way from capitalism to capitalism - (new Russian
                definition).