[net.unix-wizards] Oops, tbuf errors again

rbbb@RICE.ARPA (12/16/84)

My office-mate says that we replaced the L003 module, not the L001 module.

He also says (in his dogmatic way) that several boards can cause that
problem, since (in general) timing errors often show up in the tbuf.  So I
don't really know - I still think (unless someone will step forward and
say, "no, we did that, and things got worse for us") that the best route
is to get the PCS, and always run with the latest microcode.  Failing
that, try to convince your service droid to replace that board, and maybe
others (we used the line, "look, if you say that board is fine, then how
does it harm YOU to trade it for another working board?").

It is also possible that DEC came up with the microcode patches so that
they could use the crufty boards, and perhaps board-swapping will no
longer work so well (since you might get a crufty board).

I'm tired of so many bogus hardware rumours and so much third-hand hearsay
floating around on the net, and I'm sorry that I was wrong on that other
message.

drc

bruce@godot.UUCP (Bruce Nemnich) (12/17/84)

The L0003 should be the board to swap, since that's where the tbuf and
cache are.  We had an L0003 swapping session which lasted a week or so
in September, I believe.  DEC accidentally shipped us an extra L0003
with some other hardware (!).  I tried it out, but it was considerably
worse than the one I had been running.  I had DEC bring in a new one,
but it would halt the machine every 8 hours or so.  Finally, two boards
later, I got one which performed much better than any of them.  I kept
that until I got rev 7.  So, there is a wide range of failure rates.

One of the guys installing it said the problem was caused by crosstalk
between two layers of the L0003 board.  One ECO (don't know which) put
jumpers on the board to help but not cure the problem.  The PCS
microcode 98 claims to fix it by retrying after the first failure per
macroinstruction, but trapping on subsequent errors.

How can you have "rev 7 hardware" without the PCS?  That's part of the
FCO.  There are two versions, one for those with the user WCS and one
for those without.
-- 
--Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA
  ihnp4!godot!bruce, bjn@mit-mc.arpa ... soon to be bruce@godot.arpa

bruce@godot.UUCP (Bruce Nemnich) (12/17/84)

Oh, about why they went to PCS....

It's totally the right thing.  It is *much* easier for them to send you
a TU58 with the latest microcode than to send an FE to install new ROMs
on your L0003.  BTW, microcode 98 claims to fix a number of problems
other than the tbuf parity errors, most to do with CI stuff.

In the tbuf case, yes, they did alter the microcode to deal with the
flakey boards.  
-- 
--Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA
  ihnp4!godot!bruce, bjn@mit-mc.arpa ... soon to be bruce@godot.arpa

mminnich@udel-ee.ARPA (12/19/84)

Well, I'll add my two cents:

A few months back, we too underwent the PCS FCO on our 11/750.
Since then, it has worked fine, and we have had no crashes
due to tbuf parity faults.  (We are using the new ucode).

mike