hacker@egrunix.UUCP (Thomas J Hacker) (07/21/89)
As promised....posting of responses. Thanks to following people for responding: Larry Parmelee parmelee@cs.cornell.edu Guy Harris guy@bootme.auspex.com (Sorry if I forgot anyone else's name) Re: Disk Problems on a 11/750 running 4.3 BSD In article <115@egrunix.UUCP> you write: > So, I thought I would wait a day or two to see if it would repeat, > then this came up: > > Jul 11 18:07:30 unix vmunix: mcr0: soft ecc addr 1a72 syn 73 "mcr0" is "Memory ContRoller 0". It is most likely not related to your disk problems. As long as you only see "soft" errors, and they don't occur "too often", you can just ignore them forever. ("too often": we had a 780 that would routinely report 10-12 of those mcr0 errors per hour, and other than wasting console paper, caused no other apparent problems. It was like this for years.) Soft/Hard- "soft" means the memory "ecc" - Error Check/Correction logic detected an error but was able to correct it (single bit error). "hard" means the ecc detected an error but couldn't fix it (double bit error). "addr" and the following number, "1a72", can be used to figure out which board was failing. You need to know how much memory is on each board, and multiply the "1a72" number by 4, since the ecc logic looks at memory in 4-byte chunks: (1a72*4) mod (bytes per board) gives you the board number which had the error. Unfortunately I'm not sure how the boards are laid out in a 750. The "syn" - Syndrome and following number "73" can be used to figure out which chip on the board failed. One last note: I say "failed" above, but be aware that this generally only means that one single bit out of a large number happened to change state. With high density memory chips, this sort of thing is not entirely unexpected, hence they build the boards with ecc logic to correct the occassional expected bit flip. Mcrx soft errors can be ignored almost indefinitely, unless they start occuring in such numbers that you think a whole chip has failed. Even if a whole chip fails, you can probably "limp along" for quite a while, assuming there are no other problems on that memory board. -- Thomas Hacker ...Weave a circle round him thrice, Systems Programmer And close your eyes with holy dread, Oakland University For he on honeydew hath fed, --"Kubla Khan" hackertj@unix.secs.oakland.edu And drunk the milk of Paradise. -- ST Coleridge
parmelee@wayback.cs.cornell.edu (Larry Parmelee) (07/22/89)
In article <117@egrunix.UUCP> hacker@egrunix.UUCP (Thomas J Hacker) writes: > > Jul 11 18:07:30 unix vmunix: mcr0: soft ecc addr 1a72 syn 73 > "addr" and the following number, "1a72", can be used to figure > out which board was failing. You need to know how much memory is > on each board, and multiply the "1a72" number by 4, since the ecc > logic looks at memory in 4-byte chunks: (1a72*4) mod (bytes per board) > gives you the board number which had the error. Unfortunately I'm > not sure how the boards are laid out in a 750. Opps. I just read what I wrote, and realized I meant "div" - Integer division, not "mod": (1a72*4) div (bytes per board). Oh well. -Larry Parmelee parmelee@cs.cornell.edu
chris@mimsy.UUCP (Chris Torek) (07/24/89)
[re `mcr%d: soft ecc addr %x syn %x' errors] In article <117@egrunix.UUCP> hacker@egrunix.UUCP (Thomas J Hacker) writes: >... As long as you only see "soft" errors, and they don't occur "too >often", you can just ignore them forever. This is ill-advised. The purpose behind error-detecting-and-correcting memory is to fix the errors *and* provide a report so that failing chips can be replaced when it is convenient to halt the machine, rather than immediately after losing whatever was in progress. ("too often": we had a 780 that would routinely report 10-12 of >those mcr0 errors per hour, and other than wasting console paper, >caused no other apparent problems. It was like this for years.) 4BSD shuts off further error reports for ten minutes after each error, so a machine that reports six errors per hour probably has at least one hard failure (by this I mean `one chip that is really, truly bad': both `soft' and `hard' ECC errors can be due to either `soft' or `hard' hardware errors; a soft hardware error is like the noise your car makes whenever it is *not* in the shop). In this case a single stray cosmic ray or alpha particle can bring the machine down with an uncorrectable double-bit error, or, worse, corrupt two or more bits undetectably. Running with a known hard failure is rather like driving your Honda around when one cylinder is out---it works, but you should fix it as soon as you possibly can. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris