mark@mips.UUCP (Mark G. Johnson) (08/08/87)
In article <376@astroatc.UUCP>, johnw@astroatc.UUCP (John F. Wardale) writes: > A couple weeks ago, I asked about the infamous Intel 386 multiply > bug... Does anyone REALLY KNOW, or is the world in as much > darkness as I am? Should I request the "test/demo" program and > try to grok it? (I don't speak Intel assembler, but this may be > the only way!) > > Can anyone shed any more light on this? We're involved in > designing a large computer thru extensive use of simulation, as > I'm sure Intel did. I'm interested in the following: > > * i386 multiply bug details > -- do all the failing parts ALWAYS fail the same way, or > is it a transient failure > * why the simulations missed it (tricky 2nd order effect???) > * pitfalls of simulations (in general) > The July, 1987 issue of COMPUTER DESIGN magazine has an article on page 22 "80386 Multiplier Problem Spotlights VLSI Testability Issues" which you might find illuminating. The following excerpt is taken without permission: {beginning of excerpt} `Contrary to industry speculation, the 80386 multiplier errors result from a layout problem, not from an error in logic design. "We didn't allow enough margin to catch the worst-case pattern in the multiplier at the corners of our process," explains Dana Krelle, Intel's 80386 marketing manager. "As a result, some chips, at some temperature/ voltage/frequency points, will produce errors from particular combinations of 32-bit operands." The error, apparently due to unintentional coupling between adjacent cells in the multiplier, escaped Intel's simulation and chip verification process until it was spotted in a subsequent stress-testing program.' {end of excerpt} Another interesting facet of the problem hasn't been mentioned yet --- the 80386 doesn't have a multiplier (!!). The 386 uses its general-purpose ALU, plus a shift-and-add microcode routine, to perform multiplication operations. Similarly, divide operations use the ALU plus a shift-and-subtract microcode routine. The confusing thing is, why do multiplys fail but divides and adds (apparently) work?? Isn't it possible to supply the appropriate "bad" ALU operands for a regular add, or to create them during some step of a divide? -- -Mark Johnson *** DISCLAIMER: The opinions above are personal. *** UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mark TEL: 408-720-1700 x208 US mail: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
ken@sci.UUCP (Ken Karakotsios) (08/21/87)
In article <576@obiwan.UUCP>, mark@mips.UUCP (Mark G. Johnson) writes: > > The July, 1987 issue of COMPUTER DESIGN magazine has an article on page 22 > "80386 Multiplier Problem Spotlights VLSI Testability Issues" > which you might find illuminating. The following excerpt is taken > without permission: > > {beginning of excerpt} > `Contrary to industry speculation, the 80386 multiplier errors result > from a layout problem, not from an error in logic design. "We didn't > allow enough margin to catch the worst-case pattern in the multiplier > at the corners of our process," explains Dana Krelle, Intel's 80386 [...] > > Another interesting facet of the problem hasn't been mentioned > yet --- the 80386 doesn't have a multiplier (!!). The 386 uses > its general-purpose ALU, plus a shift-and-add microcode routine, > to perform multiplication operations. Similarly, divide operations > use the ALU plus a shift-and-subtract microcode routine. > > The confusing thing is, why do multiplys fail but divides and > adds (apparently) work?? Isn't it possible to supply the > appropriate "bad" ALU operands for a regular add, or to create > them during some step of a divide? This is just speculation, since I know nothing about the 386 design. If they are using a shift and add for multiplication, then they may have a special shift-by-two-bits latch. This is useful in doing a what I think is called a "modified Booth algorithm", where two bits worth of a multiply can be done each clock cycle, if you can make 1x, 2x and 4x of one of the inputs available. As far as I know, the shift by 2 bits isn't too useful for much else. The divide algorithms I know of which don't use parallel multipliers or groups of adders only shift by 1 bit at a time, so such a divide wouldn't use the shift-by-two latch either. So perhaps the pattern sensitivity lies in this shift register. By the way, I think you can make this multiplication algorithm work for N bits per clock cycle, if you can provide all the following multiples of one of the inputs (call it Y) : Y, 2Y, ... (2^^N)Y . Ken Karakotsios ken!sci Silicon Compiler Systems Disclaimer: These are my personal views and opinions, and don't reflect those of my employer.
ralf@b.gp.cs.cmu.edu (Ralf Brown) (08/26/87)
In article <7939@sci.UUCP> ken@sci.UUCP (Ken Karakotsios) writes: >In article <576@obiwan.UUCP>, mark@mips.UUCP (Mark G. Johnson) writes: >> >> Another interesting facet of the problem hasn't been mentioned >> yet --- the 80386 doesn't have a multiplier (!!). The 386 uses >> its general-purpose ALU, plus a shift-and-add microcode routine, >> to perform multiplication operations. Similarly, divide operations >> use the ALU plus a shift-and-subtract microcode routine. >> >> The confusing thing is, why do multiplys fail but divides and >> adds (apparently) work?? Isn't it possible to supply the >> appropriate "bad" ALU operands for a regular add, or to create >> them during some step of a divide? > > This is just speculation, since I know nothing about the 386 design. >If they are using a shift and add for multiplication, then they may have a >special shift-by-two-bits latch. [..."modified Booth algorithm"...] The 386 need not use a shift-by-two-bits latch. One thing that needs to be taken into account is that during normal instruction execution, time is needed to decode the instruction and fetch the operands (even in a pipelined machine). A microcode routine, on the other hand, needs no time to fetch and decode instructions, so the ALU can be run full-tilt. While I don't know the internals of the 286, the instruction timings for MUL and IMUL (and the following experience) lead me to believe that it uses a simple shift-and-add routine in microcode, with one shift and one add per clock cycle. My AT had an 8MHz 80286 being run at 9.83 :-) MHz, which went slightly flaky after about 6.5 months (I started seeing one line in the Turbo Pascal and Turbo C editors at the wrong place on the screen). I tracked this down to an 8-bit MUL instruction which lost one or two carry bits when the high bit of AL was set and the other operand had sufficient 1 bits to cause a cascade of carries on adding the last term (corresponding to the high bit) to the result. The CPU has since been replaced [under warrantee, this system was sold as a 10 MHz system]. BTW, the specific operands that caused the problems in the Turbo editors were 0A0h in AL and 0Fh as the other operand. (line 15 on the screen--wound up in the middle of line 4 when the MUL yielded 160h instead of 960h) And yes, the problem was temperature-dependent, at least initially--cranking up the air conditioner made it go away. Later it got to the point where, after the first six or seven minutes while the CPU was warming up, even cranking up the air conditioner or opening up the system didn't help. -- -=-=-=-=-=-=-=-= {harvard,seismo,ucbvax}!b.gp.cs.cmu.edu!ralf =-=-=-=-=-=-=-=- ARPAnet: RALF@B.GP.CS.CMU.EDU BITnet: RALF%B.GP.CS.CMU.EDU@CMUCCVMA AT&Tnet: (412) 268-3053 (school) FIDOnet: Ralf Brown at 129/31 DISCLAIMER? Who ever said I claimed anything? "I do not fear computers. I fear the lack of them..." -- Isaac Asimov
earl@mips.UUCP (Earl Killian) (08/26/87)
In article <357@ohlone.UUCP>, nelson@ohlone.UUCP (Bron Nelson) writes: > In article <7939@sci.UUCP>, ken@sci.UUCP (Ken Karakotsios) writes: > > By the way, I think you can make this multiplication algorithm > > work for N bits per clock cycle, if you can provide all the > > following multiples of one of the inputs (call it Y) : Y, 2Y, ... > > (2^^N)Y . > If'n I remember correctly, in fact you need to generate fairly nasty > values (like 3Y and 5Y) in order to get the 'N' bit modified Booth > algorithm to work. This is why you don't typically see orders higher > than 2 bits. Oh no, we're about to get another long series of messages saying "X does Y-bit booth product" just like the "Z has a W-bit word" series. I doubt that higher order booth products are uncommon. To illustrate how unspecial it is (and be the first in the long series :-) let me point out that even a RISC microprocessor, the MIPS R2000, uses a 3-bit algorithm for integer multiply. This requires computing 3Y first, but that's no big deal. Integer multiply takes 12 cycles. A little slow, but acceptable. At least it's not 33 cycles, as in architectures with multiply step instructions.