[comp.arch] Claimed bug in 80286

davidb@inmos.co.uk (David Boreham) (07/26/89)

The following appeared on the front page of a UK trade paper called
   "Electronics Weekly" on 7/19/89 :

 Headline (36pt bold) - "Design Fault Found in Processor Chip"  

Then six column-inches which I won't quote but which said the following:

  `A company called "Alsys" (an ADA compiler house) says that there is
  a bug in the 80286 involving interrupts. They have proved its existance
  using software.'

The article contained no technical information (not any which made any
sense anyhow) and said that Intel were denying that there was any problem.

Although the journalist (Leon Clifford) obviously regarded the possibility
of a bug in the 80286 as something to plaster over the front page (along
with ominous comments about planes falling out of the air due to processor
bugs), readers of this newsgroup will know that bugs in production revisions
of devices are very common and don't really merrit all this frothing at the 
mouth.

Now... Personally I think that you can't much lower than publishing
  unsubstantiated claims about someone's device in a trade paper. 

However... can anyone shed any informed light on  the matter (or
  comment on the ethics of hopping up and down madly about actual
  or possible bugs) ??
                
David Boreham, INMOS Limited | mail(uk): davidb@inmos.co.uk or ukc!inmos!davidb
Bristol,  England            |      (us): uunet!inmos-c!davidb
+44 454 616616 ex 543        | Internet : @col.hp.com:davidb@inmos-c

rcd@ico.ISC.COM (Dick Dunn) (07/28/89)

In article <1717@brwa.inmos.co.uk>, davidb@inmos.co.uk (David Boreham) writes:

> The following appeared on the front page of a UK trade paper...
>...Headline (36pt bold) - "Design Fault Found in Processor Chip"  
. . .
>   `A company called "Alsys" (an ADA compiler house) says that there is
>   a bug in the 80286 involving interrupts. They have proved its existance
>   using software.'

I'm inclined to side with David's cynicism in what is reported here.  Gosh,
they found the bug with software!  Imagine that--the hardware isn't
working, so you write a program to find it!  (In the UK, it goes in a trade
rag...in the US they'd try to patent it as a debugging technique!)

There are some bugs in some versions ("steppings" they seem to call them)
of the 286.

In terms of interrupt handling, for example, I know of a bug which will
allow a one-instruction window at the start of interrupt processing during
which interrupts are not disabled.  The effect of this bug is that it can
allow a lower-priority interrupt to come in on top of a higher-priority
interrupt...so it certainly fits all the criteria David was able to supply
from the article to identify the bug.  Dealing with it causes major rectal
discomfort...because the solution is to study the stack before dispatching
to the interrupt handler, and if it has occurred, rearrange the stack to
make the interrupts appear to have occured in a plausible order (in effect
pretending that the lower-priority interrupt occurred first).  The code
itself is trivial, but the process of debugging that code which rearranges
interrupt stack frames is seriously unpleasant.

BUT--and here's the possible major objection to the article--when I had to
deal with this, it was early '87, and the bug was old news even then; I was
rewriting someone else's code to handle the bug.

> Although the journalist (Leon Clifford) obviously regarded the possibility
> of a bug in the 80286 as something to plaster over the front page (along
> with ominous comments about planes falling out of the air due to processor
> bugs), readers of this newsgroup will know that bugs in production revisions
> of devices are very common and don't really merrit all this frothing at the 
> mouth.

Trade rags are rags.  I'd be more interested in why Alsys is proclaiming
the bug.  In particular, I'd like to know enough about the nature of the
bug to see if it's one that's been known for a while.

> Now... Personally I think that you can't much lower than publishing
>   unsubstantiated claims about someone's device in a trade paper. 

That sentence no verb, but we know what you meant.  I think there are lots
of things lower, but I take your point.

> However... can anyone shed any informed light on  the matter (or
>   comment on the ethics of hopping up and down madly about actual
>   or possible bugs) ??

Let's hear more about the bug and we'll see if it's old or new and real or
not.

I think there are good reasons to hop up and down about some bugs, but it
depends on the bug.
-- 
Dick Dunn     rcd@ico.isc.com    uucp: {ncar,nbires}!ico!rcd     (303)449-2870
   ...Simpler is better.

scc@cl.cam.ac.uk (Stephen Crawley) (08/02/89)

"Computing", another UK trade rag, treats this somewhat more rationally.

From the issue of 27 July '89, page >>6<<:

<Quote>
           "Intel chip bug delays software plan".

   Software developers have discovered a bug in an early version
of the Intel 286 chip which has caused several weeks of delay
on a software development project.
   Ada specialist Alsys said it discovered the bug -- details of
which have not been published -- during work on development of a
specialist application for one of its customers.
   The bug means that problems arise when two interrupts -- signals
sent from the hardware to the operating system -- occur simultaneously,
[[Looks like the bug that Dick Dunn described ...]]
   `Our French colleagues already knew about it but we discovered 
it in the UK' said Alsys marketting manager Martyn Jordan.  `We
consider it a serious issue -- everyone takes hardware for granted
when software doesn't work' he added.
   Although Jordan added said the chances of two interrupts happening
simultaneously are low, it means his company has had to release extra
code to deal with the problem.
  'It would have been nice if we had known about the problem first'
said Jordan.
  Ian Wilson, product marketting manager at Intel said Alsys was
using a non-Intel emulator.  `Alsys is running a 286 which is over 
4 years old.  Errors were cured years ago.'  

<End quote>

(Typo's in the above are mine)

This whole thing sound to me like Alsys' marketting making excuses
for a delayed project.

As to Leon Clifford (the Electronics Weekly journalist)'s mumblings
about airplanes falling out of the sky: if recent articles in comp.risks 
are to be believed, safety-critical realtime software don't use interrupts 
anyhow; everything is done with polling.  Isn't technology wonderful!

Me?  Cynical?  Never!!

-- Steve

hammondr@sunroof.crd.ge.com (richard a hammond) (08/03/89)

In article <852@scaup.cl.cam.ac.uk> scc@cl.cam.ac.uk (Stephen Crawley) writes:
>"Computing", another UK trade rag, treats this somewhat more rationally. ...
>From the issue of 27 July '89, page >>6<<:
>...
>  Ian Wilson, product marketting manager at Intel said Alsys was
>using a non-Intel emulator.  `Alsys is running a 286 which is over 
>4 years old.  Errors were cured years ago.'  
>...

Note several things:

1) Intel did not claim that there were NO chips in the field with the bug!
   Just that NEW chips didn't have the bug.  Pretty shady attitude if you
   ask me.  This means that if you're writing software for a '286 then you
   need to know about the bug, as Martyn said, people assume it is the
   your software's fault.

2) As pointed out in the elided part of the article and Dick Dunn's comments,
   the bug has been known for a while by some people, but I assume that it
   isn't documented by Intel?

3) Seems fair to say that the bug cost them time - whether the project would
   have been late for other reasons or not.

4) Is there a way to get a list of known (and suspected?) bugs for exisiting
   chips.  As a compiler writer it would be useful to know what to avoid.
   Note that compiler writers often aren't in the hardware design area that
   might have long known about a bug.

Rich Hammond

mod@masscomp.UUCP (Michael O'Donnell x2915) (08/05/89)

I was working at another company back in 1984 and if it wasn't us that
brought this problem to Intel's attention, I'm fairly certain we were the
first non-Intel folks to know about it.  I'd have to go back and look at
my notes for details, but it involved the NMI and INTR signals.  I seem to
recall that there is(was?) a window of time during the NMI acknowledgement
phase where, if INTR wiggles it is not noticed and gets lost.  Or maybe
it's the other way around.  Anyway, Intel provided a very simple (two
gates) circuit diagram as a hardware fix and also some suggestions for
software fixes, both of which seemed to do the trick.  I don't know if the
later steppings of the 80286 incorporate a real fix, but I presume that
they do.



 ----------------------------------------------------------------------------
   Michael O'Donnell
   Porting and Peripherals      (508)392-2915             home(508)937-0790
              ___________
             /  ________/__      ...!{harvard,uunet,petsd}!masscomp!mod
            /__/_______/  /      mod@westford.ccur.com
   Concurrent /__________/
     Computer Corporation        1 Technology Way
                                 Westford, MA  01886
 
  DISCLAIMER: My opinions only coincidentally resemble those of my employer.
 ----------------------------------------------------------------------------

davidsen@sungod.crd.ge.com (ody) (08/09/89)

In article <1473@crdgw1.crd.ge.com> hammondr@sunroof.crd.ge.com (richard a hammond) writes:

| 1) Intel did not claim that there were NO chips in the field with the bug!
|    Just that NEW chips didn't have the bug.  Pretty shady attitude if you
|    ask me.  This means that if you're writing software for a '286 then you
|    need to know about the bug, as Martyn said, people assume it is the
|    your software's fault.
| 
| 2) As pointed out in the elided part of the article and Dick Dunn's comments,
|    the bug has been known for a while by some people, but I assume that it
|    isn't documented by Intel?

  My recollection is that the bug was documented by Intel at the time
they were selling the chip, and that they gave up a fix which was 4-5
gates (or so, 1 chip did it as I recall).
| 
| 3) Seems fair to say that the bug cost them time - whether the project would
|    have been late for other reasons or not.

  Seems fair to say that they didn't have the bug list which matched
their CPU. It undoubtedly did cost them time.
| 
| 4) Is there a way to get a list of known (and suspected?) bugs for exisiting
|    chips.  As a compiler writer it would be useful to know what to avoid.
|    Note that compiler writers often aren't in the hardware design area that
|    might have long known about a bug.

  Intel has been resonably good about that, at least if you're a larger
organization than a garage.
	bill davidsen		(davidsen@crdos1.crd.GE.COM)
  {uunet | philabs}!crdgw1!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

urjlew@ecsvax.UUCP (Rostyk Lewyckyj) (08/11/89)

Please excuse my ignorance,. , however this talk of a hardware/logic
bug in an older stepping of the 80286 raises some questions in
my mind.
1. Suppose I have a computer designed/built at the time this version
of the chip was being sold. Suppose the designers had the bug sheets
and included the proper hardware fix, so that the software writers
did not have to be aware of the bug in the chip. Suppose that now
for whatever reason I replace the old 80286 chip with a new one that
does not have the bug.  What is the effect of the extra hardware
from the hardware fix on the operation of the new chip in that
computer? Does it introduce a new bug?
2. Suppose that I have an old computer without the hardware fix,
an old chip, but the software writers of my system programmed
around the bug. What happens when I replace the chip? (Or How
likely is it that now the software won't work right??)
3. How paranoid does a software developer need to be in writing
his programs? Is it necessary to get the bug lists for all previous
versions of the processor being programmed and write code that
avoids the union of all the bugs? Consider that as a distributed
product the program may be used on many different computers
(assuming a chip as widely used as the 80286 and say MS DOS)
of different ages and uncertain designs.

-----------------------------------------------
  Reply-To:  Rostyslaw Jarema Lewyckyj
             urjlew@ecsvax.UUCP ,  urjlew@unc.bitnet
       or    urjlew@uncvm1.acs.unc.edu    (ARPA,SURA,NSF etc. internet)
       tel.  (919)-962-9107

rcd@ico.ISC.COM (Dick Dunn) (08/11/89)

scc@cl.cam.ac.uk (Stephen Crawley) writes:
> "Computing", another UK trade rag, treats this somewhat more rationally.
>...<Quote>
>            "Intel chip bug delays software plan".
> 
>    Software developers have discovered a bug in an early version
> of the Intel 286 chip which has caused several weeks of delay
> on a software development project.

Note that the bug is in an early version of the chip.  However, this
probably means that it's NOT safe to assume that you don't have to worry
about it any more, unless you are in a position to control what chips are
used.  (If you control the hardware, you just specify a hardware require-
ment for chips recent enough.  If you're running on an AT, you have to
worry about it.)
...
>    The bug means that problems arise when two interrupts -- signals
> sent from the hardware to the operating system -- occur simultaneously,

It's not even that.  If the second interrupt request is present before the
time the first instruction of interrupt handling starts to execute, the bug
will manifest itself.  Since there's potentially a lot of stacking going
on, this is a big window.  When we were testing, we had no trouble causing
it once we knew what it took.  The bug doesn't cause any catastrophic
failure (like a screwed-up stack); things just happen in the wrong order.
This might cause you to miss timing windows, but my point is that you've
got some reasonable hope of identifying it.

>   Ian Wilson, product marketting manager at Intel said Alsys was
> using a non-Intel emulator.  `Alsys is running a 286 which is over 
> 4 years old.  Errors were cured years ago.'  

The $2^16 question for me is why someone would be writing new interrupt-
level code for the 286 in 1989!

> This whole thing sound to me like Alsys' marketting making excuses
> for a delayed project.

Could be, but it shouldn't be hard enough to fix that you could use it as
an excuse for more than a minor delay.  Once you find you've got a bug
in interrupt handling, you try to trace it down.  If it looks like hard-
ware, you find a list of errata for the chip you're using.  (*However*,
this may not be a trivial exercise.)
-- 
Dick Dunn     rcd@ico.isc.com    uucp: {ncar,nbires}!ico!rcd     (303)449-2870
   ...Are you making this up as you go along?

henry@utzoo.uucp (Henry Spencer) (08/13/89)

In article <7467@ecsvax.UUCP> urjlew@ecsvax.UUCP (Rostyk Lewyckyj) writes:
>1. Suppose I have a computer designed/built at the time this version
>of the chip was being sold. Suppose the designers had the bug sheets
>and included the proper hardware fix, so that the software writers
>did not have to be aware of the bug in the chip. Suppose that now
>for whatever reason I replace the old 80286 chip with a new one that
>does not have the bug.  What is the effect of the extra hardware
>from the hardware fix on the operation of the new chip in that
>computer? Does it introduce a new bug?

Quite possibly.  As they say, "your warranty is void".  Sensible designers
will try to come up with a fix that won't interfere with debugged chips,
but not all designers are sensible and such fixes sometimes aren't possible.

>2. Suppose that I have an old computer without the hardware fix,
>an old chip, but the software writers of my system programmed
>around the bug. What happens when I replace the chip? (Or How
>likely is it that now the software won't work right??)

Again, "your warranty is void".  Sensible software people will usually try
to avoid the trouble area completely, rather than finding a trick that will
make the buggy chip do the right thing, but this isn't always possible and
the software people sometimes aren't sensible.  ("You want us to meet your
deadline *and* do things right?  That's going to cost you extra...")

>3. How paranoid does a software developer need to be in writing
>his programs? Is it necessary to get the bug lists for all previous
>versions of the processor being programmed and write code that
>avoids the union of all the bugs? ...

Well, on sensible machines this is mostly the compiler writer's worry.
On a 286, of course, it bites everyone.  Getting old bug lists could
be very difficult, and working around them can be seriously hard.
Most software writers probably just take the easy way out, doing
workarounds for any that are known to be widespread and ignoring
the rest on the grounds that making the hardware work right is someone
else's problem.  (Some of us feel that the presence of a 286 CPU in a
machine is a bug, one which should be given the latter treatment... :-))
-- 
V7 /bin/mail source: 554 lines.|     Henry Spencer at U of Toronto Zoology
1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

clyde@hitech.ht.oz (Clyde Smith-Stubbs) (08/14/89)

From article <1989Aug13.023601.594@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer):
> In article <7467@ecsvax.UUCP> urjlew@ecsvax.UUCP (Rostyk Lewyckyj) writes:
> 	[Other stuff deleted]
>>3. How paranoid does a software developer need to be in writing
>>his programs? Is it necessary to get the bug lists for all previous
>>versions of the processor being programmed and write code that
>>avoids the union of all the bugs? ...

Apart from anything else programming around the union of all previous
bugs may be impossible since it is not uncommon for bugs in different
releases of a chip to be mutually incompatible, e.g. the workaround for
a bug in version ABC.123 may fall foul of a different bug in version
ABF.456. In this situation the best approach is usually to steer totally
clear of all areas suspected of being buggy. Mind you this gets hard when
some of the bugs are in areas which are totally indispensable.
-- 
Clyde Smith-Stubbs
HI-TECH Software, P.O. Box 103, ALDERLEY, QLD, 4051, AUSTRALIA.
INTERNET:	clyde@hitech.ht.oz.au		PHONE:	+61 7 300 5011
UUCP:		uunet!hitech.ht.oz.au!clyde	FAX:	+61 7 300 5246

marcus@hp-ptp.HP.COM (Marcus_Liesching) (08/16/89)

/ hp-ptp:comp.arch / clyde@hitech.ht.oz (Clyde Smith-Stubbs) / 10:33 pm  Aug 13, 1989 /
From article <1989Aug13.023601.594@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer):
> In article <7467@ecsvax.UUCP> urjlew@ecsvax.UUCP (Rostyk Lewyckyj) writes:
> 	[Other stuff deleted]
>>3. How paranoid does a software developer need to be in writing
>>his programs? Is it necessary to get the bug lists for all previous
>>versions of the processor being programmed and write code that
>>avoids the union of all the bugs? ...

Apart from anything else programming around the union of all previous
bugs may be impossible since it is not uncommon for bugs in different
releases of a chip to be mutually incompatible, e.g. the workaround for
a bug in version ABC.123 may fall foul of a different bug in version
ABF.456. In this situation the best approach is usually to steer totally
clear of all areas suspected of being buggy. Mind you this gets hard when
some of the bugs are in areas which are totally indispensable.
-- 
Clyde Smith-Stubbs
HI-TECH Software, P.O. Box 103, ALDERLEY, QLD, 4051, AUSTRALIA.
INTERNET:	clyde@hitech.ht.oz.au		PHONE:	+61 7 300 5011
UUCP:		uunet!hitech.ht.oz.au!clyde	FAX:	+61 7 300 5246
----------

mmm@cup.portal.com (Mark Robert Thorson) (08/17/89)

clyde@hitech.ht.oz.au says:
> 
> Apart from anything else programming around the union of all previous
> bugs may be impossible since it is not uncommon for bugs in different
> releases of a chip to be mutually incompatible, e.g. the workaround for
> a bug in version ABC.123 may fall foul of a different bug in version
> ABF.456. In this situation the best approach is usually to steer totally
> clear of all areas suspected of being buggy. Mind you this gets hard when
> some of the bugs are in areas which are totally indispensable.

The 386 and beyond have both device type and mask stepping numbers which
appear in one of the registers (DX, I believe) following reset initialization.
I think the 286 also had this.  Obviously, all computer architectures should
incorporate this feature.

davidb@braa.inmos.co.uk (David Boreham) (08/21/89)

In article <21352@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes:
>
> .... (deleted) .....
>The 386 and beyond have both device type and mask stepping numbers which
>appear in one of the registers (DX, I believe) following reset initialization.
>I think the 286 also had this.  Obviously, all computer architectures should
>incorporate this feature.

Yes, this is fine and we do it. Unfortunately it gives you the added 
problem that you need to update the ID on *every* change to the device.
Sometimes the changes required to change the ID are more work than the
actual bug-fix or whatever.

David Boreham, INMOS Limited | mail(uk): davidb@inmos.co.uk or ukc!inmos!davidb
Bristol,  England            |      (us): uunet!inmos-c!davidb
+44 454 616616 ex 543        | Internet : @col.hp.com:davidb@inmos-c

les@unicads.UUCP (Les Milash) (08/22/89)

In article <21352@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes:
>The 386 and beyond have both device type and mask stepping numbers
                                              ^^^^^^^^^^^^^^^^^^^^^
do you mean the x,y of the individual die on the wafer!?!?!?  yikes!

i can see how that'd be useful, but seems like a royal pain if each die on
the wafer had to be different!

Les Milash, scheme zealot and recent S&M (shared memory) convert.

mcp@ziebmef.mef.org (Marc Plumb) (08/28/89)

les@unicads.UUCP (Les Milash) writes:
> mmm@cup.portal.com (Mark Robert Thorson) writes:
>> The 386 and beyond have both device type and mask stepping numbers
                                                ^^^^^^^^^^^^^^^^^^^^^
> Do you mean the x,y of the individual die on the wafer!?!?!?  Yikes!

> I can see how that'd be useful, but seems like a royal pain if each die on
> the wafer had to be different!

No, this has nothing to do with stepper motors.  When a revision is made
to a mask (a bugfix or feature added, that doesn't warrant a whole new part
number), it's called a "step."  It means about the same thing as "revision."

This lets you conditionalise the software so it uses funny workarounds only
when necessary.
-- 
	-Colin Plumb

chip@vector.Dallas.TX.US (Chip Rosenthal) (09/03/89)

mcp@ziebmef.mef.org (Colin Plumb) writes:
>>> The 386 and beyond have both device type and mask stepping numbers
>> Do you mean the x,y of the individual die on the wafer!?!?!?  Yikes!
>
>No, this has nothing to do with stepper motors.

Not only that, but his misconception had nothing to do with stepper motors
either.

Photolithogrophy is often done with a mask which contains the image of
one die or a small number of dice, and this mask is stepped across the
wafer exposing the wafer one part at a time.  In the old days (and sometimes
still today), the mask will contain the image of a layer for the entire
wafer, and hence the wafer is exposed all at once.

One nice thing about step and repeat is you can get much better imaging,
i.e. by using a 10x reticle.  With projection, you generally use 1x.
After all, a 60 inch mask to make a 6" wafer would be a bit *ahem*
unwieldy.  On the other hand, a defect on a step-and-repeat mask is
catastrophic -- one bit of poop and you've blown the entire run.

In the old days, it wasn't always so obvious when a mask had a problem.
The technique was to throw a wafer on the xerox machine and make a
transparency.  You then place every wafer in the batch under the transparency
one at a time, and mark off the locations of good dice.  The locations
left without a mark were probably due to mask defects.

-- 
Chip Rosenthal / chip@vector.Dallas.TX.US / Dallas Semiconductor / 214-450-5337
"I wish you'd put that starvation box down and go to bed" - Albert Collins' Mom