[comp.unix.wizards] memory errors

bossert@Thalatta.COM (John Bossert) (02/21/88)

Rn wouldn't let me cancel my previous article.  Sorry.

The true error message I was getting on a 11/780 with
8 Mb of interleaved memory was:

	mcr0: soft ecc addr 110dd syn 26

Again, using the formula addr = 0110dd26 (0xxxxxyy) this
translates to an address way above 8 Megs.  Does the interleaving
have anything to do with this?  Where is the bad board?
-- 

In-Real-Life: John Bossert, Thalatta Corporation, (+1 206 643 7187)
Domain: bossert@Thalatta.COM   Path: uw-beaver!uw-entropy!thebes!bossert

parmelee@wayback.cs.cornell.edu (Larry Parmelee) (02/22/88)

In article <128@thebes.Thalatta.COM> bossert@Thalatta.COM 
(John Bossert) writes:
> The true error message I was getting on a 11/780 with
> 8 Mb of interleaved memory was:
> 
> 	mcr0: soft ecc addr 110dd syn 26
> 

Credentials:
We have an 11/780, with 8Mb of memory interleaved on two memory
controllers, using 256kb memory boards, running 4.3BSD unix.

Until recently, we had lots of memory errors;  Finally with the
help of our DEC Technician, we were able to figure out how to
interpret that message to the board level.

First, "mcr#" is "Memory Controller #".  (That was easy, right?)
The "addr" part is a little more interesting.

We have 256kb memory boards, which works out to 0x40000 bytes
per board.  Memory error correction/detection works on 4-byte
"globs", so a 256kb memory board has

    0x40000(bytes) / 4 (bytes-per-glob) = 0x10000 (globs-per-board).

Now, the number following "addr" in the message above is the
"glob" number where the error occurred, so

    0x110dd'th glob / 0x10000 (globs-per-board) = board number 1

(Ignore the remainder for now).  For these purposes, the boards
are numbered starting with 0 (the memory in each memory controller
is considered individually).  Physically, board 0 is the leftmost
memory board in the given controller. 

The remainder from above and the syndrome  (the number following
"syn") can be used to figure out which chip had the problem, but
with 256kb memory going for $25 a board (used) nowadays, it wasn't
worth figuring out the rest.  We just replaced the board(s).

I think you said you had 1Mb boards;  In that case, it would work 
out like this:  0x100000 bytes-per-board, or 0x40000 "globs" per
board.  Then 0x110dd / 0x40000 = board number 0.

Have fun!
-Larry Parmelee
parmelee@wayback.cs.cornell.edu
parmelee@cornell.uucp