[comp.unix.questions] WRT Disk errors on 11/750 running 4.3 BSD

hacker@egrunix.UUCP (Thomas J Hacker) (07/21/89)

As promised....posting of responses.
Thanks to following people for responding:

Larry Parmelee
parmelee@cs.cornell.edu

Guy Harris
guy@bootme.auspex.com
(Sorry if I forgot anyone else's name)

Re: Disk Problems on a 11/750 running 4.3 BSD


In article <115@egrunix.UUCP> you write:
> So, I thought I would wait a day or two to see if it would repeat,
> then this came up:
> 
> Jul 11 18:07:30 unix vmunix: mcr0: soft ecc addr 1a72 syn 73

"mcr0" is "Memory ContRoller 0".  It is most likely not related to
your disk problems.  As long as you only see "soft" errors, and
they don't occur "too often", you can just ignore them forever.
("too often": we had a 780 that would routinely report 10-12 of
those mcr0 errors per hour, and other than wasting console paper,
caused no other apparent problems.  It was like this for years.)

Soft/Hard- "soft" means the memory "ecc" - Error Check/Correction
logic detected an error but was able to correct it (single bit error).
"hard" means the ecc detected an error but couldn't fix it (double 
bit error). 

"addr" and the following number, "1a72", can be used to figure
out which board was failing.  You need to know how much memory is
on each board, and multiply the "1a72" number by 4, since the ecc
logic looks at memory in 4-byte chunks:  (1a72*4) mod (bytes per board)
gives you the board number which had the error.  Unfortunately I'm
not sure how the boards are laid out in a 750.

The "syn" - Syndrome and following number "73" can be used to
figure out which chip on the board failed.

One last note:  I say "failed" above, but be aware that this generally
only means that one single bit out of a large number happened to 
change state.  With high density memory chips, this sort of thing is
not entirely unexpected, hence they build the boards with ecc logic
to correct the occassional expected bit flip.  Mcrx soft errors can
be ignored almost indefinitely, unless they start occuring in such
numbers that you think a whole chip has failed.  Even if a whole chip
fails, you can probably "limp along" for quite a while, assuming there
are no other problems on that memory board.  

-- 
Thomas Hacker               ...Weave a circle round him thrice,
Systems Programmer             And close your eyes with holy dread, 
Oakland University	       For he on honeydew hath fed, --"Kubla Khan" 
hackertj@unix.secs.oakland.edu And drunk the milk of Paradise. -- ST Coleridge

parmelee@wayback.cs.cornell.edu (Larry Parmelee) (07/22/89)

In article <117@egrunix.UUCP> hacker@egrunix.UUCP (Thomas J Hacker) writes:

> > Jul 11 18:07:30 unix vmunix: mcr0: soft ecc addr 1a72 syn 73

> "addr" and the following number, "1a72", can be used to figure
> out which board was failing.  You need to know how much memory is
> on each board, and multiply the "1a72" number by 4, since the ecc
> logic looks at memory in 4-byte chunks:  (1a72*4) mod (bytes per board)
> gives you the board number which had the error.  Unfortunately I'm
> not sure how the boards are laid out in a 750.

Opps.  I just read what I wrote, and realized I meant "div" - Integer
division, not "mod":   (1a72*4) div (bytes per board).  Oh well.

-Larry Parmelee
parmelee@cs.cornell.edu

chris@mimsy.UUCP (Chris Torek) (07/24/89)

[re `mcr%d: soft ecc addr %x syn %x' errors]

In article <117@egrunix.UUCP> hacker@egrunix.UUCP (Thomas J Hacker) writes:
>... As long as you only see "soft" errors, and they don't occur "too
>often", you can just ignore them forever.

This is ill-advised.  The purpose behind error-detecting-and-correcting
memory is to fix the errors *and* provide a report so that failing chips
can be replaced when it is convenient to halt the machine, rather than
immediately after losing whatever was in progress.

("too often": we had a 780 that would routinely report 10-12 of
>those mcr0 errors per hour, and other than wasting console paper,
>caused no other apparent problems.  It was like this for years.)

4BSD shuts off further error reports for ten minutes after each error,
so a machine that reports six errors per hour probably has at least one
hard failure (by this I mean `one chip that is really, truly bad':
both `soft' and `hard' ECC errors can be due to either `soft' or `hard'
hardware errors; a soft hardware error is like the noise your car makes
whenever it is *not* in the shop).  In this case a single stray cosmic
ray or alpha particle can bring the machine down with an uncorrectable
double-bit error, or, worse, corrupt two or more bits undetectably.
Running with a known hard failure is rather like driving your Honda
around when one cylinder is out---it works, but you should fix it as
soon as you possibly can.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris