[net.unix-wizards] EMC 2MB memory boards on 780 crashes

irwin@uiucdcs.UUCP (01/28/85)

I would venture to say that the bug is that one of the emc2 boards has
a bad chip causing a steady stream of memory errors.

We have two 780s. Each was purchased with 4 meg of the 256k/board mem.
We moved the 4 meg module from one machine to the other in an expansion
cabinet, and purchased a 780E mem module to replace the vacated space
on the first machine.

The boards that came in ours from DEC was two 1 meg boards. We added
six one meg boards of emc2 to fill out the 8 meg. Since the unit has
an upper and lower controller with an interface board between them
to get them onto the mass bus, they have to be interleaved, so they
need to be balanced as to the amount in each.

When we first got ours, the Micro Diag #2 would not check the 780E mem.
DEC got me an Diag #3 floppy which could be run to test the new mem
and we still continued to use the #2 floppy to check the old mem on the
other machine.

I noted that the #2 took a considerable length of time to test the old
mem, but that the #3 floppy would whip through 8 meg of the 780E type
in no time flat. I thought <so what>, and ignored it until we had a board
go bad. It was showing up on the console as errors, but the #3 diag
would not show anything wrong. We lived with it until the errors got
bad enough that a steady stream of errors was present. We run 4.2BSD.

What happend then was that the memory controller would hang the mass bus
and not report <anything> to the console. This was because it was getting
an error while it was in the process of trying to correct one, which would
confuse the memory controller.

The problem here is that the mem test on the #3 floppy is not thorough,
as was the #2 floppy (that's why it gets done so fast) and it does not
catch the errors. I discussed this with DEC and they are aware. They have
stuff that can be run under VMS to do a better job, and are working on a
better mem test for the #3 diag floppy (so I am told).

I would suggest that you install your mem so that you have the DEC mem
in one side and the emc2 in the other. Disable the outermost emc2 board
with the disable switch and leave out 2 meg of DEC on the other side.
If it comes up and runs ok, the disabled board is the bad guy. If it 
still does not run, trade the two emc2 boards in their slots so that you
can disable the opposite one as the outermost board, and try it again. It
may be that only one of them is bad, this will prove it. If you pin it
down, call emc2, a new board will be at your door the next morning (any
where on the US mainland) when you go in and you can return the bad one
in the same box. If you are across the pond somewhere it may take longer to
get the replacement. **<Don't forget to turn off the mem power supply>**

If this works as it did in our case to locate the bad board, you will
be in good shape, except that you still will not have any diagnostics
from DEC that tells you anything......bug them!!

If this does not help, I can mail you the commands to dump the registers
in both upper and lower controllers, to see if the error bit is set, which
is probably the case in one of the two.

I might add, we have been running it several months now, with no additional
problems.