irwin@uiucdcs.UUCP (01/28/85)
I would venture to say that the bug is that one of the emc2 boards has a bad chip causing a steady stream of memory errors. We have two 780s. Each was purchased with 4 meg of the 256k/board mem. We moved the 4 meg module from one machine to the other in an expansion cabinet, and purchased a 780E mem module to replace the vacated space on the first machine. The boards that came in ours from DEC was two 1 meg boards. We added six one meg boards of emc2 to fill out the 8 meg. Since the unit has an upper and lower controller with an interface board between them to get them onto the mass bus, they have to be interleaved, so they need to be balanced as to the amount in each. When we first got ours, the Micro Diag #2 would not check the 780E mem. DEC got me an Diag #3 floppy which could be run to test the new mem and we still continued to use the #2 floppy to check the old mem on the other machine. I noted that the #2 took a considerable length of time to test the old mem, but that the #3 floppy would whip through 8 meg of the 780E type in no time flat. I thought <so what>, and ignored it until we had a board go bad. It was showing up on the console as errors, but the #3 diag would not show anything wrong. We lived with it until the errors got bad enough that a steady stream of errors was present. We run 4.2BSD. What happend then was that the memory controller would hang the mass bus and not report <anything> to the console. This was because it was getting an error while it was in the process of trying to correct one, which would confuse the memory controller. The problem here is that the mem test on the #3 floppy is not thorough, as was the #2 floppy (that's why it gets done so fast) and it does not catch the errors. I discussed this with DEC and they are aware. They have stuff that can be run under VMS to do a better job, and are working on a better mem test for the #3 diag floppy (so I am told). I would suggest that you install your mem so that you have the DEC mem in one side and the emc2 in the other. Disable the outermost emc2 board with the disable switch and leave out 2 meg of DEC on the other side. If it comes up and runs ok, the disabled board is the bad guy. If it still does not run, trade the two emc2 boards in their slots so that you can disable the opposite one as the outermost board, and try it again. It may be that only one of them is bad, this will prove it. If you pin it down, call emc2, a new board will be at your door the next morning (any where on the US mainland) when you go in and you can return the bad one in the same box. If you are across the pond somewhere it may take longer to get the replacement. **<Don't forget to turn off the mem power supply>** If this works as it did in our case to locate the bad board, you will be in good shape, except that you still will not have any diagnostics from DEC that tells you anything......bug them!! If this does not help, I can mail you the commands to dump the registers in both upper and lower controllers, to see if the error bit is set, which is probably the case in one of the two. I might add, we have been running it several months now, with no additional problems.