davidb@inmos.co.uk (David Boreham) (07/26/89)
The following appeared on the front page of a UK trade paper called "Electronics Weekly" on 7/19/89 : Headline (36pt bold) - "Design Fault Found in Processor Chip" Then six column-inches which I won't quote but which said the following: `A company called "Alsys" (an ADA compiler house) says that there is a bug in the 80286 involving interrupts. They have proved its existance using software.' The article contained no technical information (not any which made any sense anyhow) and said that Intel were denying that there was any problem. Although the journalist (Leon Clifford) obviously regarded the possibility of a bug in the 80286 as something to plaster over the front page (along with ominous comments about planes falling out of the air due to processor bugs), readers of this newsgroup will know that bugs in production revisions of devices are very common and don't really merrit all this frothing at the mouth. Now... Personally I think that you can't much lower than publishing unsubstantiated claims about someone's device in a trade paper. However... can anyone shed any informed light on the matter (or comment on the ethics of hopping up and down madly about actual or possible bugs) ?? David Boreham, INMOS Limited | mail(uk): davidb@inmos.co.uk or ukc!inmos!davidb Bristol, England | (us): uunet!inmos-c!davidb +44 454 616616 ex 543 | Internet : @col.hp.com:davidb@inmos-c
rcd@ico.ISC.COM (Dick Dunn) (07/28/89)
In article <1717@brwa.inmos.co.uk>, davidb@inmos.co.uk (David Boreham) writes: > The following appeared on the front page of a UK trade paper... >...Headline (36pt bold) - "Design Fault Found in Processor Chip" . . . > `A company called "Alsys" (an ADA compiler house) says that there is > a bug in the 80286 involving interrupts. They have proved its existance > using software.' I'm inclined to side with David's cynicism in what is reported here. Gosh, they found the bug with software! Imagine that--the hardware isn't working, so you write a program to find it! (In the UK, it goes in a trade rag...in the US they'd try to patent it as a debugging technique!) There are some bugs in some versions ("steppings" they seem to call them) of the 286. In terms of interrupt handling, for example, I know of a bug which will allow a one-instruction window at the start of interrupt processing during which interrupts are not disabled. The effect of this bug is that it can allow a lower-priority interrupt to come in on top of a higher-priority interrupt...so it certainly fits all the criteria David was able to supply from the article to identify the bug. Dealing with it causes major rectal discomfort...because the solution is to study the stack before dispatching to the interrupt handler, and if it has occurred, rearrange the stack to make the interrupts appear to have occured in a plausible order (in effect pretending that the lower-priority interrupt occurred first). The code itself is trivial, but the process of debugging that code which rearranges interrupt stack frames is seriously unpleasant. BUT--and here's the possible major objection to the article--when I had to deal with this, it was early '87, and the bug was old news even then; I was rewriting someone else's code to handle the bug. > Although the journalist (Leon Clifford) obviously regarded the possibility > of a bug in the 80286 as something to plaster over the front page (along > with ominous comments about planes falling out of the air due to processor > bugs), readers of this newsgroup will know that bugs in production revisions > of devices are very common and don't really merrit all this frothing at the > mouth. Trade rags are rags. I'd be more interested in why Alsys is proclaiming the bug. In particular, I'd like to know enough about the nature of the bug to see if it's one that's been known for a while. > Now... Personally I think that you can't much lower than publishing > unsubstantiated claims about someone's device in a trade paper. That sentence no verb, but we know what you meant. I think there are lots of things lower, but I take your point. > However... can anyone shed any informed light on the matter (or > comment on the ethics of hopping up and down madly about actual > or possible bugs) ?? Let's hear more about the bug and we'll see if it's old or new and real or not. I think there are good reasons to hop up and down about some bugs, but it depends on the bug. -- Dick Dunn rcd@ico.isc.com uucp: {ncar,nbires}!ico!rcd (303)449-2870 ...Simpler is better.
scc@cl.cam.ac.uk (Stephen Crawley) (08/02/89)
"Computing", another UK trade rag, treats this somewhat more rationally. From the issue of 27 July '89, page >>6<<: <Quote> "Intel chip bug delays software plan". Software developers have discovered a bug in an early version of the Intel 286 chip which has caused several weeks of delay on a software development project. Ada specialist Alsys said it discovered the bug -- details of which have not been published -- during work on development of a specialist application for one of its customers. The bug means that problems arise when two interrupts -- signals sent from the hardware to the operating system -- occur simultaneously, [[Looks like the bug that Dick Dunn described ...]] `Our French colleagues already knew about it but we discovered it in the UK' said Alsys marketting manager Martyn Jordan. `We consider it a serious issue -- everyone takes hardware for granted when software doesn't work' he added. Although Jordan added said the chances of two interrupts happening simultaneously are low, it means his company has had to release extra code to deal with the problem. 'It would have been nice if we had known about the problem first' said Jordan. Ian Wilson, product marketting manager at Intel said Alsys was using a non-Intel emulator. `Alsys is running a 286 which is over 4 years old. Errors were cured years ago.' <End quote> (Typo's in the above are mine) This whole thing sound to me like Alsys' marketting making excuses for a delayed project. As to Leon Clifford (the Electronics Weekly journalist)'s mumblings about airplanes falling out of the sky: if recent articles in comp.risks are to be believed, safety-critical realtime software don't use interrupts anyhow; everything is done with polling. Isn't technology wonderful! Me? Cynical? Never!! -- Steve
hammondr@sunroof.crd.ge.com (richard a hammond) (08/03/89)
In article <852@scaup.cl.cam.ac.uk> scc@cl.cam.ac.uk (Stephen Crawley) writes: >"Computing", another UK trade rag, treats this somewhat more rationally. ... >From the issue of 27 July '89, page >>6<<: >... > Ian Wilson, product marketting manager at Intel said Alsys was >using a non-Intel emulator. `Alsys is running a 286 which is over >4 years old. Errors were cured years ago.' >... Note several things: 1) Intel did not claim that there were NO chips in the field with the bug! Just that NEW chips didn't have the bug. Pretty shady attitude if you ask me. This means that if you're writing software for a '286 then you need to know about the bug, as Martyn said, people assume it is the your software's fault. 2) As pointed out in the elided part of the article and Dick Dunn's comments, the bug has been known for a while by some people, but I assume that it isn't documented by Intel? 3) Seems fair to say that the bug cost them time - whether the project would have been late for other reasons or not. 4) Is there a way to get a list of known (and suspected?) bugs for exisiting chips. As a compiler writer it would be useful to know what to avoid. Note that compiler writers often aren't in the hardware design area that might have long known about a bug. Rich Hammond
mod@masscomp.UUCP (Michael O'Donnell x2915) (08/05/89)
I was working at another company back in 1984 and if it wasn't us that brought this problem to Intel's attention, I'm fairly certain we were the first non-Intel folks to know about it. I'd have to go back and look at my notes for details, but it involved the NMI and INTR signals. I seem to recall that there is(was?) a window of time during the NMI acknowledgement phase where, if INTR wiggles it is not noticed and gets lost. Or maybe it's the other way around. Anyway, Intel provided a very simple (two gates) circuit diagram as a hardware fix and also some suggestions for software fixes, both of which seemed to do the trick. I don't know if the later steppings of the 80286 incorporate a real fix, but I presume that they do. ---------------------------------------------------------------------------- Michael O'Donnell Porting and Peripherals (508)392-2915 home(508)937-0790 ___________ / ________/__ ...!{harvard,uunet,petsd}!masscomp!mod /__/_______/ / mod@westford.ccur.com Concurrent /__________/ Computer Corporation 1 Technology Way Westford, MA 01886 DISCLAIMER: My opinions only coincidentally resemble those of my employer. ----------------------------------------------------------------------------
davidsen@sungod.crd.ge.com (ody) (08/09/89)
In article <1473@crdgw1.crd.ge.com> hammondr@sunroof.crd.ge.com (richard a hammond) writes: | 1) Intel did not claim that there were NO chips in the field with the bug! | Just that NEW chips didn't have the bug. Pretty shady attitude if you | ask me. This means that if you're writing software for a '286 then you | need to know about the bug, as Martyn said, people assume it is the | your software's fault. | | 2) As pointed out in the elided part of the article and Dick Dunn's comments, | the bug has been known for a while by some people, but I assume that it | isn't documented by Intel? My recollection is that the bug was documented by Intel at the time they were selling the chip, and that they gave up a fix which was 4-5 gates (or so, 1 chip did it as I recall). | | 3) Seems fair to say that the bug cost them time - whether the project would | have been late for other reasons or not. Seems fair to say that they didn't have the bug list which matched their CPU. It undoubtedly did cost them time. | | 4) Is there a way to get a list of known (and suspected?) bugs for exisiting | chips. As a compiler writer it would be useful to know what to avoid. | Note that compiler writers often aren't in the hardware design area that | might have long known about a bug. Intel has been resonably good about that, at least if you're a larger organization than a garage. bill davidsen (davidsen@crdos1.crd.GE.COM) {uunet | philabs}!crdgw1!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
urjlew@ecsvax.UUCP (Rostyk Lewyckyj) (08/11/89)
Please excuse my ignorance,. , however this talk of a hardware/logic bug in an older stepping of the 80286 raises some questions in my mind. 1. Suppose I have a computer designed/built at the time this version of the chip was being sold. Suppose the designers had the bug sheets and included the proper hardware fix, so that the software writers did not have to be aware of the bug in the chip. Suppose that now for whatever reason I replace the old 80286 chip with a new one that does not have the bug. What is the effect of the extra hardware from the hardware fix on the operation of the new chip in that computer? Does it introduce a new bug? 2. Suppose that I have an old computer without the hardware fix, an old chip, but the software writers of my system programmed around the bug. What happens when I replace the chip? (Or How likely is it that now the software won't work right??) 3. How paranoid does a software developer need to be in writing his programs? Is it necessary to get the bug lists for all previous versions of the processor being programmed and write code that avoids the union of all the bugs? Consider that as a distributed product the program may be used on many different computers (assuming a chip as widely used as the 80286 and say MS DOS) of different ages and uncertain designs. ----------------------------------------------- Reply-To: Rostyslaw Jarema Lewyckyj urjlew@ecsvax.UUCP , urjlew@unc.bitnet or urjlew@uncvm1.acs.unc.edu (ARPA,SURA,NSF etc. internet) tel. (919)-962-9107
rcd@ico.ISC.COM (Dick Dunn) (08/11/89)
scc@cl.cam.ac.uk (Stephen Crawley) writes: > "Computing", another UK trade rag, treats this somewhat more rationally. >...<Quote> > "Intel chip bug delays software plan". > > Software developers have discovered a bug in an early version > of the Intel 286 chip which has caused several weeks of delay > on a software development project. Note that the bug is in an early version of the chip. However, this probably means that it's NOT safe to assume that you don't have to worry about it any more, unless you are in a position to control what chips are used. (If you control the hardware, you just specify a hardware require- ment for chips recent enough. If you're running on an AT, you have to worry about it.) ... > The bug means that problems arise when two interrupts -- signals > sent from the hardware to the operating system -- occur simultaneously, It's not even that. If the second interrupt request is present before the time the first instruction of interrupt handling starts to execute, the bug will manifest itself. Since there's potentially a lot of stacking going on, this is a big window. When we were testing, we had no trouble causing it once we knew what it took. The bug doesn't cause any catastrophic failure (like a screwed-up stack); things just happen in the wrong order. This might cause you to miss timing windows, but my point is that you've got some reasonable hope of identifying it. > Ian Wilson, product marketting manager at Intel said Alsys was > using a non-Intel emulator. `Alsys is running a 286 which is over > 4 years old. Errors were cured years ago.' The $2^16 question for me is why someone would be writing new interrupt- level code for the 286 in 1989! > This whole thing sound to me like Alsys' marketting making excuses > for a delayed project. Could be, but it shouldn't be hard enough to fix that you could use it as an excuse for more than a minor delay. Once you find you've got a bug in interrupt handling, you try to trace it down. If it looks like hard- ware, you find a list of errata for the chip you're using. (*However*, this may not be a trivial exercise.) -- Dick Dunn rcd@ico.isc.com uucp: {ncar,nbires}!ico!rcd (303)449-2870 ...Are you making this up as you go along?
henry@utzoo.uucp (Henry Spencer) (08/13/89)
In article <7467@ecsvax.UUCP> urjlew@ecsvax.UUCP (Rostyk Lewyckyj) writes: >1. Suppose I have a computer designed/built at the time this version >of the chip was being sold. Suppose the designers had the bug sheets >and included the proper hardware fix, so that the software writers >did not have to be aware of the bug in the chip. Suppose that now >for whatever reason I replace the old 80286 chip with a new one that >does not have the bug. What is the effect of the extra hardware >from the hardware fix on the operation of the new chip in that >computer? Does it introduce a new bug? Quite possibly. As they say, "your warranty is void". Sensible designers will try to come up with a fix that won't interfere with debugged chips, but not all designers are sensible and such fixes sometimes aren't possible. >2. Suppose that I have an old computer without the hardware fix, >an old chip, but the software writers of my system programmed >around the bug. What happens when I replace the chip? (Or How >likely is it that now the software won't work right??) Again, "your warranty is void". Sensible software people will usually try to avoid the trouble area completely, rather than finding a trick that will make the buggy chip do the right thing, but this isn't always possible and the software people sometimes aren't sensible. ("You want us to meet your deadline *and* do things right? That's going to cost you extra...") >3. How paranoid does a software developer need to be in writing >his programs? Is it necessary to get the bug lists for all previous >versions of the processor being programmed and write code that >avoids the union of all the bugs? ... Well, on sensible machines this is mostly the compiler writer's worry. On a 286, of course, it bites everyone. Getting old bug lists could be very difficult, and working around them can be seriously hard. Most software writers probably just take the easy way out, doing workarounds for any that are known to be widespread and ignoring the rest on the grounds that making the hardware work right is someone else's problem. (Some of us feel that the presence of a 286 CPU in a machine is a bug, one which should be given the latter treatment... :-)) -- V7 /bin/mail source: 554 lines.| Henry Spencer at U of Toronto Zoology 1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
clyde@hitech.ht.oz (Clyde Smith-Stubbs) (08/14/89)
From article <1989Aug13.023601.594@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer): > In article <7467@ecsvax.UUCP> urjlew@ecsvax.UUCP (Rostyk Lewyckyj) writes: > [Other stuff deleted] >>3. How paranoid does a software developer need to be in writing >>his programs? Is it necessary to get the bug lists for all previous >>versions of the processor being programmed and write code that >>avoids the union of all the bugs? ... Apart from anything else programming around the union of all previous bugs may be impossible since it is not uncommon for bugs in different releases of a chip to be mutually incompatible, e.g. the workaround for a bug in version ABC.123 may fall foul of a different bug in version ABF.456. In this situation the best approach is usually to steer totally clear of all areas suspected of being buggy. Mind you this gets hard when some of the bugs are in areas which are totally indispensable. -- Clyde Smith-Stubbs HI-TECH Software, P.O. Box 103, ALDERLEY, QLD, 4051, AUSTRALIA. INTERNET: clyde@hitech.ht.oz.au PHONE: +61 7 300 5011 UUCP: uunet!hitech.ht.oz.au!clyde FAX: +61 7 300 5246
marcus@hp-ptp.HP.COM (Marcus_Liesching) (08/16/89)
/ hp-ptp:comp.arch / clyde@hitech.ht.oz (Clyde Smith-Stubbs) / 10:33 pm Aug 13, 1989 / From article <1989Aug13.023601.594@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer): > In article <7467@ecsvax.UUCP> urjlew@ecsvax.UUCP (Rostyk Lewyckyj) writes: > [Other stuff deleted] >>3. How paranoid does a software developer need to be in writing >>his programs? Is it necessary to get the bug lists for all previous >>versions of the processor being programmed and write code that >>avoids the union of all the bugs? ... Apart from anything else programming around the union of all previous bugs may be impossible since it is not uncommon for bugs in different releases of a chip to be mutually incompatible, e.g. the workaround for a bug in version ABC.123 may fall foul of a different bug in version ABF.456. In this situation the best approach is usually to steer totally clear of all areas suspected of being buggy. Mind you this gets hard when some of the bugs are in areas which are totally indispensable. -- Clyde Smith-Stubbs HI-TECH Software, P.O. Box 103, ALDERLEY, QLD, 4051, AUSTRALIA. INTERNET: clyde@hitech.ht.oz.au PHONE: +61 7 300 5011 UUCP: uunet!hitech.ht.oz.au!clyde FAX: +61 7 300 5246 ----------
mmm@cup.portal.com (Mark Robert Thorson) (08/17/89)
clyde@hitech.ht.oz.au says: > > Apart from anything else programming around the union of all previous > bugs may be impossible since it is not uncommon for bugs in different > releases of a chip to be mutually incompatible, e.g. the workaround for > a bug in version ABC.123 may fall foul of a different bug in version > ABF.456. In this situation the best approach is usually to steer totally > clear of all areas suspected of being buggy. Mind you this gets hard when > some of the bugs are in areas which are totally indispensable. The 386 and beyond have both device type and mask stepping numbers which appear in one of the registers (DX, I believe) following reset initialization. I think the 286 also had this. Obviously, all computer architectures should incorporate this feature.
davidb@braa.inmos.co.uk (David Boreham) (08/21/89)
In article <21352@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes: > > .... (deleted) ..... >The 386 and beyond have both device type and mask stepping numbers which >appear in one of the registers (DX, I believe) following reset initialization. >I think the 286 also had this. Obviously, all computer architectures should >incorporate this feature. Yes, this is fine and we do it. Unfortunately it gives you the added problem that you need to update the ID on *every* change to the device. Sometimes the changes required to change the ID are more work than the actual bug-fix or whatever. David Boreham, INMOS Limited | mail(uk): davidb@inmos.co.uk or ukc!inmos!davidb Bristol, England | (us): uunet!inmos-c!davidb +44 454 616616 ex 543 | Internet : @col.hp.com:davidb@inmos-c
les@unicads.UUCP (Les Milash) (08/22/89)
In article <21352@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes: >The 386 and beyond have both device type and mask stepping numbers ^^^^^^^^^^^^^^^^^^^^^ do you mean the x,y of the individual die on the wafer!?!?!? yikes! i can see how that'd be useful, but seems like a royal pain if each die on the wafer had to be different! Les Milash, scheme zealot and recent S&M (shared memory) convert.
mcp@ziebmef.mef.org (Marc Plumb) (08/28/89)
les@unicads.UUCP (Les Milash) writes: > mmm@cup.portal.com (Mark Robert Thorson) writes: >> The 386 and beyond have both device type and mask stepping numbers ^^^^^^^^^^^^^^^^^^^^^ > Do you mean the x,y of the individual die on the wafer!?!?!? Yikes! > I can see how that'd be useful, but seems like a royal pain if each die on > the wafer had to be different! No, this has nothing to do with stepper motors. When a revision is made to a mask (a bugfix or feature added, that doesn't warrant a whole new part number), it's called a "step." It means about the same thing as "revision." This lets you conditionalise the software so it uses funny workarounds only when necessary. -- -Colin Plumb
chip@vector.Dallas.TX.US (Chip Rosenthal) (09/03/89)
mcp@ziebmef.mef.org (Colin Plumb) writes: >>> The 386 and beyond have both device type and mask stepping numbers >> Do you mean the x,y of the individual die on the wafer!?!?!? Yikes! > >No, this has nothing to do with stepper motors. Not only that, but his misconception had nothing to do with stepper motors either. Photolithogrophy is often done with a mask which contains the image of one die or a small number of dice, and this mask is stepped across the wafer exposing the wafer one part at a time. In the old days (and sometimes still today), the mask will contain the image of a layer for the entire wafer, and hence the wafer is exposed all at once. One nice thing about step and repeat is you can get much better imaging, i.e. by using a 10x reticle. With projection, you generally use 1x. After all, a 60 inch mask to make a 6" wafer would be a bit *ahem* unwieldy. On the other hand, a defect on a step-and-repeat mask is catastrophic -- one bit of poop and you've blown the entire run. In the old days, it wasn't always so obvious when a mask had a problem. The technique was to throw a wafer on the xerox machine and make a transparency. You then place every wafer in the batch under the transparency one at a time, and mark off the locations of good dice. The locations left without a mark were probably due to mask defects. -- Chip Rosenthal / chip@vector.Dallas.TX.US / Dallas Semiconductor / 214-450-5337 "I wish you'd put that starvation box down and go to bed" - Albert Collins' Mom