rbbb@RICE.ARPA (12/16/84)
My office-mate says that we replaced the L003 module, not the L001 module. He also says (in his dogmatic way) that several boards can cause that problem, since (in general) timing errors often show up in the tbuf. So I don't really know - I still think (unless someone will step forward and say, "no, we did that, and things got worse for us") that the best route is to get the PCS, and always run with the latest microcode. Failing that, try to convince your service droid to replace that board, and maybe others (we used the line, "look, if you say that board is fine, then how does it harm YOU to trade it for another working board?"). It is also possible that DEC came up with the microcode patches so that they could use the crufty boards, and perhaps board-swapping will no longer work so well (since you might get a crufty board). I'm tired of so many bogus hardware rumours and so much third-hand hearsay floating around on the net, and I'm sorry that I was wrong on that other message. drc
bruce@godot.UUCP (Bruce Nemnich) (12/17/84)
The L0003 should be the board to swap, since that's where the tbuf and cache are. We had an L0003 swapping session which lasted a week or so in September, I believe. DEC accidentally shipped us an extra L0003 with some other hardware (!). I tried it out, but it was considerably worse than the one I had been running. I had DEC bring in a new one, but it would halt the machine every 8 hours or so. Finally, two boards later, I got one which performed much better than any of them. I kept that until I got rev 7. So, there is a wide range of failure rates. One of the guys installing it said the problem was caused by crosstalk between two layers of the L0003 board. One ECO (don't know which) put jumpers on the board to help but not cure the problem. The PCS microcode 98 claims to fix it by retrying after the first failure per macroinstruction, but trapping on subsequent errors. How can you have "rev 7 hardware" without the PCS? That's part of the FCO. There are two versions, one for those with the user WCS and one for those without. -- --Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA ihnp4!godot!bruce, bjn@mit-mc.arpa ... soon to be bruce@godot.arpa
bruce@godot.UUCP (Bruce Nemnich) (12/17/84)
Oh, about why they went to PCS.... It's totally the right thing. It is *much* easier for them to send you a TU58 with the latest microcode than to send an FE to install new ROMs on your L0003. BTW, microcode 98 claims to fix a number of problems other than the tbuf parity errors, most to do with CI stuff. In the tbuf case, yes, they did alter the microcode to deal with the flakey boards. -- --Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA ihnp4!godot!bruce, bjn@mit-mc.arpa ... soon to be bruce@godot.arpa
mminnich@udel-ee.ARPA (12/19/84)
Well, I'll add my two cents: A few months back, we too underwent the PCS FCO on our 11/750. Since then, it has worked fine, and we have had no crashes due to tbuf parity faults. (We are using the new ucode). mike