[comp.lsi] 80386 Multiply: quote from Intel

mark@mips.UUCP (Mark G. Johnson) (08/08/87)

In article <376@astroatc.UUCP>, johnw@astroatc.UUCP (John F. Wardale) writes:

>	A couple weeks ago, I asked about the infamous Intel 386 multiply
>	bug...  Does anyone REALLY KNOW, or is the world in as much
>	darkness as I am?  Should I request the "test/demo" program and
>	try to grok it?  (I don't speak Intel assembler, but this may be
>	the only way!)
>
>	Can anyone shed any more light on this?  We're involved in
>	designing a large computer thru extensive use of simulation, as
>	I'm sure Intel did.  I'm interested in the following:
>
>	* i386 multiply bug details
>		-- do all the failing parts ALWAYS fail the same way, or
>		   is it a transient failure
>	* why the simulations missed it (tricky 2nd order effect???)
>	* pitfalls of simulations (in general)
>

The July, 1987 issue of COMPUTER DESIGN magazine has an article on page 22
	"80386 Multiplier Problem Spotlights VLSI Testability Issues"
which you might find illuminating.  The following excerpt is taken
without permission:

{beginning of excerpt}
`Contrary to industry speculation, the 80386 multiplier errors result
from a layout problem, not from an error in logic design.  "We didn't
allow enough margin to catch the worst-case pattern in the multiplier
at the corners of our process," explains Dana Krelle, Intel's 80386
marketing manager.  "As a result, some chips, at some temperature/
voltage/frequency points, will produce errors from particular combinations
of 32-bit operands."  The error, apparently due to unintentional
coupling between adjacent cells in the multiplier, escaped Intel's
simulation and chip verification process until it was spotted in a
subsequent stress-testing program.'	{end of excerpt}

Another interesting facet of the problem hasn't been mentioned
yet --- the 80386 doesn't have a multiplier (!!).  The 386 uses
its general-purpose ALU, plus a shift-and-add microcode routine,
to perform multiplication operations.  Similarly, divide operations
use the ALU plus a shift-and-subtract microcode routine.

The confusing thing is, why do multiplys fail but divides and
adds (apparently) work??  Isn't it possible to supply the
appropriate "bad" ALU operands for a regular add, or to create
them during some step of a divide?
-- 
-Mark Johnson	*** DISCLAIMER: The opinions above are personal. ***	
UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mark   TEL: 408-720-1700 x208
US mail: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

ken@sci.UUCP (Ken Karakotsios) (08/21/87)

In article <576@obiwan.UUCP>, mark@mips.UUCP (Mark G. Johnson) writes:
> 
> The July, 1987 issue of COMPUTER DESIGN magazine has an article on page 22
> 	"80386 Multiplier Problem Spotlights VLSI Testability Issues"
> which you might find illuminating.  The following excerpt is taken
> without permission:
> 
> {beginning of excerpt}
> `Contrary to industry speculation, the 80386 multiplier errors result
> from a layout problem, not from an error in logic design.  "We didn't
> allow enough margin to catch the worst-case pattern in the multiplier
> at the corners of our process," explains Dana Krelle, Intel's 80386 [...]
> 
> Another interesting facet of the problem hasn't been mentioned
> yet --- the 80386 doesn't have a multiplier (!!).  The 386 uses
> its general-purpose ALU, plus a shift-and-add microcode routine,
> to perform multiplication operations.  Similarly, divide operations
> use the ALU plus a shift-and-subtract microcode routine.
> 
> The confusing thing is, why do multiplys fail but divides and
> adds (apparently) work??  Isn't it possible to supply the
> appropriate "bad" ALU operands for a regular add, or to create
> them during some step of a divide?

	This is just speculation, since I know nothing about the 386 design.
If they are using a shift and add for multiplication, then they may have a
special shift-by-two-bits latch.  This is useful in doing a what I think is
called a "modified Booth algorithm", where two bits worth of a multiply can
be done each clock cycle, if you can make 1x, 2x and 4x of one of the 
inputs available.  As far as I know, the shift by 2 bits isn't too useful
for much else.  The divide algorithms I know of which don't use parallel
multipliers or groups of adders only shift by 1 bit at a time, so such a 
divide wouldn't use the shift-by-two latch either.  So perhaps the pattern
sensitivity lies in this shift register.

	By the way, I think you can make this multiplication algorithm work for
N bits per clock cycle, if you can provide all the following multiples of one
of the inputs (call it Y) :  Y, 2Y,  ... (2^^N)Y .


	Ken Karakotsios				ken!sci
	Silicon Compiler Systems


	Disclaimer: These are my personal views and opinions, and don't
	reflect those of my employer.

ralf@b.gp.cs.cmu.edu (Ralf Brown) (08/26/87)

In article <7939@sci.UUCP> ken@sci.UUCP (Ken Karakotsios) writes:
>In article <576@obiwan.UUCP>, mark@mips.UUCP (Mark G. Johnson) writes:
>> 
>> Another interesting facet of the problem hasn't been mentioned
>> yet --- the 80386 doesn't have a multiplier (!!).  The 386 uses
>> its general-purpose ALU, plus a shift-and-add microcode routine,
>> to perform multiplication operations.  Similarly, divide operations
>> use the ALU plus a shift-and-subtract microcode routine.
>> 
>> The confusing thing is, why do multiplys fail but divides and
>> adds (apparently) work??  Isn't it possible to supply the
>> appropriate "bad" ALU operands for a regular add, or to create
>> them during some step of a divide?
>
>	This is just speculation, since I know nothing about the 386 design.
>If they are using a shift and add for multiplication, then they may have a
>special shift-by-two-bits latch. [..."modified Booth algorithm"...]

The 386 need not use a shift-by-two-bits latch.  One thing that needs to be
taken into account is that during normal instruction execution, time is needed
to decode the instruction and fetch the operands (even in a pipelined machine).
A microcode routine, on the other hand, needs no time to fetch and decode 
instructions, so the ALU can be run full-tilt.

While I don't know the internals of the 286, the instruction timings for
MUL and IMUL (and the following experience) lead me to believe that it uses a
simple shift-and-add routine in microcode, with one shift and one add per 
clock cycle.  My AT had an 8MHz 80286 being run at 9.83 :-) MHz, which went 
slightly flaky after about 6.5 months (I started seeing one line in the Turbo 
Pascal and Turbo C editors at the wrong place on the screen).  I tracked this 
down to an 8-bit MUL instruction which lost one or two carry bits when the 
high bit of AL was set and the other operand had sufficient 1 bits to cause
a cascade of carries on adding the last term (corresponding to the high bit)
to the result.  The CPU has since been replaced [under warrantee, this system
was sold as a 10 MHz system].

BTW, the specific operands that caused the problems in the Turbo editors were
0A0h in AL and 0Fh as the other operand. (line 15 on the screen--wound up in the
middle of line 4 when the MUL yielded 160h instead of 960h)   And yes, the
problem was temperature-dependent, at least initially--cranking up the air
conditioner made it go away.  Later it got to the point where, after the first
six or seven minutes while the CPU was warming up, even cranking up the air
conditioner or opening up the system didn't help.

-- 
-=-=-=-=-=-=-=-= {harvard,seismo,ucbvax}!b.gp.cs.cmu.edu!ralf =-=-=-=-=-=-=-=-
ARPAnet: RALF@B.GP.CS.CMU.EDU            BITnet: RALF%B.GP.CS.CMU.EDU@CMUCCVMA
AT&Tnet: (412) 268-3053 (school)         FIDOnet: Ralf Brown at 129/31
	        DISCLAIMER?  Who ever said I claimed anything? 
"I do not fear computers.  I fear the lack of them..." -- Isaac Asimov

earl@mips.UUCP (Earl Killian) (08/26/87)

In article <357@ohlone.UUCP>, nelson@ohlone.UUCP (Bron Nelson) writes:
> In article <7939@sci.UUCP>, ken@sci.UUCP (Ken Karakotsios) writes:
> > 	By the way, I think you can make this multiplication algorithm
> > work for N bits per clock cycle, if you can provide all the
> > following multiples of one of the inputs (call it Y) : Y, 2Y, ...
> > (2^^N)Y .
> If'n I remember correctly, in fact you need to generate fairly nasty
> values (like 3Y and 5Y) in order to get the 'N' bit modified Booth
> algorithm to work.  This is why you don't typically see orders higher
> than 2 bits.

Oh no, we're about to get another long series of messages saying "X
does Y-bit booth product" just like the "Z has a W-bit word" series.
I doubt that higher order booth products are uncommon.  To illustrate
how unspecial it is (and be the first in the long series :-) let me
point out that even a RISC microprocessor, the MIPS R2000, uses a
3-bit algorithm for integer multiply.  This requires computing 3Y
first, but that's no big deal.  Integer multiply takes 12 cycles.  A
little slow, but acceptable.  At least it's not 33 cycles, as in
architectures with multiply step instructions.