mark@mips.COM (Mark G. Johnson) (09/04/89)
It's impossible to build a perfect arbiter/synchronizer, one that will unambiguously decide in bounded time, which of two asynchronous inputs arrived first. However, with careful design the probability of failure can be made quite small (but not zero). Recently chip vendors have introduced "metastable hardened" flip-flops, characterized arbiters, "synchronizer qualified" PALs, and so forth. These manufacturers are working hard to provide reliable, easy-to-use, inexpensive chips that have a low probability of failure. UNFORTUNATELY the synchronizer failure data they supply is given in a form that is all too often misunderstood by customers (system designers). Manufacturer's datasheets {e.g. 74F786, 74AS3374} loudly trumpet the fact that, if you operate their chip under such-and-such conditions, the Mean Time Between Failures will be 1 century (100 years; 3E9 seconds). The problem is the phrase "Mean Time Between Failures"; it conjures up an intuitive model that is comfortable but mathematically incorrect. For example: a certain brand of wristwatch has an MTBF of 5 years; what's the probability that a given watch will still work properly after 5 years? Intuitively the answer seems to be 50%; this is the way "Mean"s ought to work. But 50% is incorrect; we've let the word "Mean" mislead our intuition. There is actually a _37%_ probability the watch will still work properly after 5 years (1.00 * MTBF) of time. I hypothesize that when people think of a "mean", they imagine an average value of a **distribution**which**is**approximately**symmetric**. The very familiar Gaussian (bell curve) distribution is indeed symmetric, as are those of many physical phenomena in everyday life. Sadly, the distribution of random failures (parameterized by MTBF) is distinctly *asymmetric*, so intuition is in this case misleading. Intuition says 50% but the answer is 37%. Philosophizing aside, the distribution of random failures is given by Probability( a failure will occur between = 1.0 - exp(-T/m) time 0 and time T ) where the parameter m>0 is called the Mean Time Between Failures. Note that the distribution is highly skewed, towards early failures (failures before T=m). _-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_- Now let's put this equation to use. Suppose we are system designers at Hewlett International Digital Sun Business Machines, Inc. (HIDSBM), and we are going to build & sell a system that contains a synchronizer. HIDSBM plans to sell 50,000 of these systems, and the systems have a useful life of 5 years before they become obsolete and are thrown away. First, let's compute the probability that one or more of the 50,000 systems will fail sometime within 5 years due to a synchronizer failure: P = 50000 * [1 - exp(-5/100)] P = 1 So it is virtually guaranteed that one or more customers will have their HIDSBM system mysteriously fail at least once during the 5 year lifespan, if the synchronizer MTBF is 100 years. The Big Boss at HIDSBM has handed down an edict. She wants us to design the system such that the entire customer base (50,000 systems) will experience fewer than 100 synchronizer failures over the 5 year product life. We have to choose a synchronizer that gives this performance. Let's decide what she really meant. "There will be fewer than 100 synchronizer failures" means "the probability of 100 or more failures is small". Let's see what happens if we let the probability of 100 or more failures be 33 percent. (Thus we have a 1-in-3 chance of being fired by the Big Boss because of >99 synchronization failures in the field). P(100 or more failures 0.33 = out of 50,000 units = (50,000 / 100) * [1 - exp(-5/m)] over a 5 year period) solving, m = required MTBF of synchronizer = 7,573 years So, in this approximately-real-world example, the chip manufacturer's proud claim "1 century MTBF" isn't good enough. We need a synchronizer that's 76 times better. To conclude: Reliable systems require _very_ reliable synchronizers. Rather than rely upon intuition, it's best to use the random failure probability equation. For some types of design problems, the required MTBF of synchronizers can reach into tens of thousands of years. Chip manufacturers aren't being deliberately misleading; it's just that the phrase "Mean Time Between Failures is 100 years" has a mathematical implication which is different than what intuition would predict. -- -- Mark Johnson MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 (408) 991-0208 mark@mips.com {or ...!decwrl!mips!mark}
amos@taux01.UUCP (Amos Shapir) (09/05/89)
In article <26811@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes: [An excellent analysis of MTBF - if you have not read it yet, go back and read it now] |Philosophizing aside, the distribution of random failures is given by | | Probability( a failure will | occur between = 1.0 - exp(-T/m) | time 0 and time T ) | |where the parameter m>0 is called the Mean Time Between Failures. |Note that the distribution is highly skewed, towards early failures |(failures before T=m). Maybe we should use the term Half Life, used by physicists to compute radioactive decay; it should be computed as HL = ln(2)*MTBF -- Amos Shapir amos@taux01.nsc.com or amos@nsc.nsc.com National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel Tel. +972 52 522261 TWX: 33691, fax: +972-52-558322 34 48 E / 32 10 N (My other cpu is a NS32532)