[comp.sys.ibm.pc] engineering margins

rcd@ico.isc.com (Dick Dunn) (11/11/89)

Just because yours works...

It never ceases to amaze me how cavalier some people are about ignoring the
margins for proper operation engineered into products--or for that matter,
how much time they'll spend rationalizing their decisions to "live on the
edge" by doing so.

Yes, a 20-MHz 386 will probably run just fine at 25 MHz.  It may even run
at 30 MHz.  Yes, 100-ns DRAMs may run at 80 ns.  Yes, most MFM drives can
be run as RLL instead.  So why is the 20-MHz processor marked as 20
instead of 25, etc.?

It's so that it will be reliable!  That means a lot more than just saying
that it will work in your machine today...it means that it will work in any
properly designed machine, both now and years from now when some of the
components have drifted in value a bit.  It means that it will work in a
machine which just happens to have gotten the worst-case mix of parts at
the edges of their rated tolerances.  It means that it will work if the
machine gets warm or cool.  It means that it will work in an aging machine
with worst-case components in a room at 85 F.

The fact that you can run your 20 MHz chip at 25 really doesn't mean squat.
Individual examples of pushing tolerances don't warrant generalizations.
It may work that way forever, or it may fail tomorrow.  If it works, you're
lucky, not clever.  If it works for a year and then fails at the demo in
front of your biggest customer, you were unlucky...and stupid.  The
engineers who designed the chip and worked out the manufacturing process
planned for enough margin to allow things to work with normal variations.
You threw away the safety margin.  Now, maybe it's worthwhile to accept the
risk of system failure to get extra speed or capacity--that's a decision
you may want to make...just make it with your eyes open.  Don't think
you're getting something for nothing.

BUT WAIT, you say:  What about those motherboards we've been hearing about
that advertise 25 MHz but come with a 20 MHz processor?  Simple--there are
bad engineers who design boards.  Running a CPU at 25% over its rated clock
speed is just plain sloppy.  It's a way to make a cheaper, poorer product.
Some of them will fail eventually; the failure rate will be low, although
it will be a lot higher than a board built with a proper 25 MHz processor.
If you get such a board, you're taking a higher chance that you'll have to
return the board or get them to exchange the processor for the right part,
leaving you in the meantime with a flaky or inoperative system.  Again,
that may be what you want, trying to get the most speed possible within a
limited budget and accepting a greater chance of problems as part of the
compromise...just go into it aware of the tradeoff.

My personal view on the motherboard business is that I wouldn't buy one
from a company that put in a lower-speed-rating processor than the mother-
board clock speed, EVEN IF they agreed to upgrade the processor on request.
Why not?  Trust.  It's easy to see the discrepancy in the processor clock
speed since it's stamped on the chip package.  It's not easy to find out
where else on the motherboard they may have cheated on proper timing
tolerances--too many gate delays or whatever...but if I see it once, I'll
suspect it's happened more than once.  Speed isn't the only thing that
matters to me.
-- 
Dick Dunn     rcd@ico.isc.com    uucp: {ncar,nbires}!ico!rcd     (303)449-2870
   ...Keep your day job 'til your night job pays.

pjh@mccc.uucp (Pete Holsberg) (11/12/89)

My experience in manufacturing electronic components (granted, it was a
long time ago) suggests that reliability is not guaranteed no matter
what.  A manufacturer tests a part say for speed, and on the basis of
actual performance, places each device into a separate category.  For
example, if a 386 performs OK during the 25 MHz test, it's marked
"25MHz",  If not, it's tested for 20MHz, etc.  To be sure, some kind of
life testing is usually done to ensure that the test itself is reliable.
 But on the other hand, I'll bet that testing is done statistically,
i.e., a sample of a batch of devices is tested and on the basis of the
test results, the entire batch is categorized.

Perhaps the safest approach is to derate the parts: use a 25MHz part at
20MHz, just to be sure that "drift" never gets you in trouble during the
lifetime of the device.

-- 
Pete Holsberg                UUCP: {...!rutgers!}princeton!mccc!pjh
Mercer College               CompuServe: 70240,334
1200 Old Trenton Road        GEnie: PJHOLSBERG
Trenton, NJ 08690            Voice: 1-609-586-4800

poffen@chomolungma (Russ Poffenberger) (11/15/89)

In article <1989Nov12.154710.25669@mccc.uucp> pjh@mccc.UUCP (Pete Holsberg) writes:
>
>My experience in manufacturing electronic components (granted, it was a
>long time ago) suggests that reliability is not guaranteed no matter
>what.  A manufacturer tests a part say for speed, and on the basis of
>actual performance, places each device into a separate category.  For
>example, if a 386 performs OK during the 25 MHz test, it's marked
>"25MHz",  If not, it's tested for 20MHz, etc.  To be sure, some kind of
>life testing is usually done to ensure that the test itself is reliable.
> But on the other hand, I'll bet that testing is done statistically,
>i.e., a sample of a batch of devices is tested and on the basis of the
>test results, the entire batch is categorized.
>
>Perhaps the safest approach is to derate the parts: use a 25MHz part at
>20MHz, just to be sure that "drift" never gets you in trouble during the
>lifetime of the device.
>


OK. You guys asked for it, here is what most companies do about testing parts.
I think I am qualified because I have worked for 5 years with a company that
makes testers to test these types of devices.
Note that different manufacturers have different strategies, and these
strategies may vary between different types of parts. (ie memory devices may
be tested differently from uP's or ASIC's)

Virtually all parts are tested and categorized for speed (if applicable).
Sometimes manufacturers will run extended characterization tests on select
parts (every hundred or so). The Speed categorization is much like mentioned
above, the parts are tested at the most demanding speed, and if they fail, may
be downgraded to a slower test and tried there. This process repeats with the
standard speed categories until the part passes one of them, or fails all
tests in which case it is trashed.

Note that the testing often is more complicated than just speed categories.
They may also be tested at various speed and voltage settings. They may also
be tested ate the wafer level (before packaging) to determine parts that may
possibly meet Mil requirements.

Occasionally, parts may be never be tested at the maximum speed possible due
to market conditions. If the latest demand is for 20Mhz parts, they may skip
the 25Mhz tests altogether in order to fill market demand. In this case, you
may get a part that can run at a faster speed, but you have no way of knowing.

Even when parts pass a certain speed category, it is usually done at more
rigorous conditions than you usually see in its final application. Usually
they are elevated and chilled in temperature, and run at both high and low
VCC.

My judgement is that a part should only be run at its rated speed. Running it
faster may work and never give problems, but running it faster may do other
things like make it run hotter than normal, and the cooling of the computer
may not handle it well enough (note that all parts in the computer, not just
the CPU run faster and generate more heat) and the parts may fail at these
elevated temperatures.


Russ Poffenberger               DOMAIN: poffen@sj.ate.slb.com
Schlumberger Technologies       UUCP:   {uunet,decwrl,amdahl}!sjsca4!poffen
1601 Technology Drive		CIS:	72401,276
San Jose, Ca. 95110
(408)437-5254
-------------------------
In a dictatorship, people suffer without complaining.
In a democracy, people complain without suffering.