chris@yarra.oz.au (Chris Jankowski) (12/13/90)
The following is an edited summary of all responses to my four questions: I hope my editing did not distort the answers I received. Many thanks to all respondents. ------------------------------------------------------------------------------ 1. Can somebody tell me what is the exact definition of MTBF? -------------- From: Jim Kummer <kummer@pogo.den.mmc.com> Mean Time Between Failures. Usually represented by a statistical distribution called the "bathtub curve" - dropping steeply at first (infant mortality), then leveling off (exponential mortality), and then arcing back up further out (wear-out). MTBF is the reciprocal of the failure rate, usually represented by the symbol "lambda". This is because of the root in exponetial failure models, e raised to the power "minus lambda t". -------------- From: north@manta.NOSC.MIL (Mark H. North) It's a parameter (mean) of the exponential distribution. The probability of failure is usually modeled as p(t) = 1 - exp(-t/MTBF). -------------- From: mark@mips.COM (Mark G. Johnson) ********************************************************************* * * * Probability( a failure will * * occur between = 1.0 - exp(-T/m) * * time 0 and time T ) * * * * where the parameter m>0 is called the Mean Time Before Failure * * * ********************************************************************* Note that *the* *distribution* *is* *highly* *skewed*, towards early failures (failures before the MTBF). The mathematical definition of MTBF implies the following nonintuitive result: WEIRD FACT: If a doodad-widget has an MTBF of 5 years, what is the probability that it will fail sometime between time = 0 and time = 5 years? Answer: 63.2 percent!! [1 - exp(-1)] This buggs me; the word "mean" in MTBF sure makes me want to say the answer ought to be 50%. But it isn't, too bad. Moral: be careful. WEIRD FACT #2: If a doodad-widget has an MTBF of 5 years, what is the probability that it will fail sometime between time = 0 and time = 3.46 years? Answer: 50 percent. -------------- And another definition with subsequent correction from another respondent: From: mcmahan@netcom.UUCP (Dave Mc Mahan) It is the Mean Time Before Failure. It is expressed in units of hours (usually) and is the expected life of a product before you can expect half the units to fail. If your MTBF is 63,600 hours (ten years) and you have 1000 units, you can expect 1/2 to fail within the first 10 years due to all causes. It is really a statistical approximation of how long something can be expected to continue to work. From: dtate@unix.cis.pitt.edu (David M Tate) Actually, as you've defined it here, this is *median* time before failure. For many lifetime distributions (including the frequently-invoked Weibull and exponential distributions), the mean and the median are not the same. For example, if individual product lifetimes are exponentially distributed with mean 1 unit, then the *average* individual lifetime will converge to 1 as you look at more and more units, but the expected time before half have failed will be t such that Pr{ lifetime <= t} = .5, which in this case would be ln(2) = .693... time units. ------------------------------------------------------------------------------ 2. And how can I calculate availability of a system as a % of uptime per year knowing its MTBF? And what is availability anyway? -------------- From: Jim Kummer <kummer@pogo.den.mmc.com> Availability is simply the percentage of time that the system is available - uptime divided by total time. Of course, "total time" is that time that the system should have been up - that is, if your system is only supposed to be operating 16 hours a day, then that is the value that should be used in computing the denominator, i.e. - availability = uptime (hours)/(16*(number of days in sample)) also, - availability = 1 - downtime/total_time. To estimate availability, you need MTBF, and also MTTR - Mean Time To Repair. Then - availability = MTBF/(MTBF + MTTR). -------------- From: mcmahan@netcom.UUCP (Dave Mc Mahan) Availability is is a percentage expressed in the amount of down time due to repair compared to the up time. Usually down time for repair is expressed as the amount of time it takes someone to figure out what board failed and to plug in a new board, as this has proven to be the fastest method for fixing something. It also assumes that a good supply of spare boards that work are available. The math gets quite complicated if you assume a certain probability of the spare boards not working, but a good approximation is given by: Availability = up_time / (up_time + down_time) This assumes that down_time is very short as compared to up_time and that the probability of a failure in a spare unit is very low. This is usually the case. To figure out the availability, you need to know the expected failure rate and the mean time to repair. Expected failure rate is found from the MTBF. It is 1/MTBF. This value is expressed usually in units of failures/trillion-hours. Something like a mica cap being used at .8 of the rated voltage would have a failure rate of 4 FITs at 30 degrees C. That means at that temp and voltage rating, you will see 4 failures every trillion hours. That's a failure of about 1 per every 28,500 years. Not much for such a part, but it adds up quickly when you start throwing a couple of hundred parts on a board and some of them are active. A transistor has a failure rate of something like 60 FITs, depending on its usage and ambient temp. ------------------------------------------------------------------------------ 3. And how can I calculate MTBF of a system built of components of known MTBF ratings? -------------- From kummer@pogo.den.mmc.com Wed Dec 5 11:04:33 1990 This can get complicated for a system with many components. Failure rates combine like electrical resistance. Serial components add failure rates, or, in terms of MTBF, it is the reciprocals that add. 1/MTBF(combined) = 1/MTBF(1) + 1/MTBF(2) + ... And vice versa for the parallel redundant components. You also must compute the combined MTTR, which is the weighted average of all the component MTTRs, where the weighting factor is inverse of the component MTBF, i.e. - MTTR(system) = MTBF(system)*(MTTR(1)/MTBF(1) + MTTR(2)/MTBF(2) + - MTTR(3)/MTBF(3) + ...). There are computerized models to aid in these calculations. One such was developed for NASA, called ARAM - Automated Reliability, Availability, Maintainability program. It runs on a PC/AT. Perhaps if you inquire at one of the NASA offices, you can get ahold of this program. -------------- From: mcmahan@netcom.UUCP (Dave Mc Mahan) The aggregate MTBF of a system can be found by first calculating the FITs of each part (1/MTBF). Now, sum up all those FITs, and convert it back to MTBF by using the formula 1/FITs. If you have 1000 parts, each with a an MTBF of 100 years, all stuck in a system, you can expect that system to have an MTBF of 876 hours, or 36.5 days. This example assumes 365 days/year. As you can see, all those little FITs add up!!! ------------------------------------------------------------------------------ 4. Documentation references etc. --------------- From: mcmahan@netcom.UUCP (Dave Mc Mahan) The best reference I know of is Mil Handbook 217D (or 217E, or whatever version they are currently on). It has all kinds of parts listed, correction factors, formulas, and deratings. This handbook is about 4 inches thick and is double sided. It is published by some arm of the US Govt or DoD, I'm not sure which. --------------- From: kell@mprgate.mpr.ca (Dave Kell) I believe the most commonly accepted (excluding commercial computer manufacturers) definition of these is found in the US DoD document MIL-HDBK-217E (I think E is the current revision). It describes all of what you ask. ------------------------------------------------------------------------------ Thanks again. -m------- Chris Jankowski - Senior Systems Engineer chris@yarra.oz.au ---mmm----- Pyramid Technology Corporation Pty. Ltd. fax +61 3 521 3799 -----mmmmm--- 1st Floor, 553 St. Kilda Road tel. +61 3 525 1730 -------mmmmmmm- Melbourne, Victoria, 3004 AUSTRALIA (03) 521 3799 socialism, n. - a tortuous way from capitalism to capitalism - (new Russian definition).