[net.unix-wizards] I have memory problems on my 750 4.1bsd

jim (03/31/83)

I forgot what I was going to say...

Oh yeah!  Last week I started getting Very Strange Behavior on my 750
running 4.1bsd.  Ordinary programs, like "cat" and "ls", started
dumping core.  The machine crashed twice in two hours.  Suspecting
hardware, I ran diagnostics and turned up a bad memory board.

Now I don't mind things breaking occasionally.  Beaver has been pretty
good to me, compared to some other machines around here (no names).
But it disturbs me that bad memory first manifested itself in flakey
behavior, rather than in some notification from the kernel of parity
errors.  Why didn't my Unix kernel catch this problem?

I would be interested in hearing from anyone who has had memory go out
on them, and what the kernel did (or did not) tell them about the
condition of the machine.

	Jim Rees
	Jim@uw-beaver (Arpa or uucp)

ptw (04/01/83)

     We had problems with some El-cheapo (non-DEC) memory in our 750, which
were being incorrectly reported as "tbuf par err".  We got a patch to
machdep.c (attributed to S. Leffler) verbally before we were on USENET that
was supposed to cure our translation buffer problems, while we waited for DEC
to fix the hardware (we're still waiting).

     It turned out that we were not really having that many translation buffer
problems (the hint was that they always got worse when we put our flaky-puff
memory on line).  A careful examination of the hardware revealed that Unix was
not being careful enough and interpreting bus errors (memory double-bit
errors) as translation buffer errors.

     I suspect that if you had only the Leffler change, you might just
continue on after a memory error (thinking you had cleared the translation
error) and have funny things happen.

     We made the following change to Leffler's change, so that unrecoverable
memory errors crash, but real translation buffer errors are still cleared and
processing continues.  (There probably should be a limit on the number of
times you will clear and continue in a certain time period... )  We have also
returned our El-cheapo's.

583d582
< #define MC750_tbpar 2
623a623
> #define MC750_tbpar 4
646a647
> 		mtpr(TBIA, 0);                  /* assume bad, ala VMS */
690c691
< 		if ((type&0xf) == MC750_tbpar) {
---
> 		if ((mcf->mc5_mcesr&0xf) == MC750_tbpar) {
692d692
< 		    mtpr(TBIA, 0);


     The complete change to machdep.c from the 4.1 sources is as follows:

572a573,574
>  * Except on translation buffer errors, which are recoverable by invalidating
>  * the buffer and continuing.
620a623
> #define MC750_tbpar 4
643a647
> 		mtpr(TBIA, 0);                  /* assume bad, ala VMS */
680c684
< 		printf("\tva %x errpc %x mdr %x smr %x tbgpar %x cacherr %x\n",
---
> 		printf("\tva %x errpc %x mdr %x smr %x rdtimo %x tbgpar %x cacherr %x\n",
686a691,694
> 		if ((mcf->mc5_mcesr&0xf) == MC750_tbpar) {
> 		    printf("tbuf par: flushing and returning\n");
> 		    return;
> 		    }

			     P. Tucker Withington
			     Automatix Incorporated
			     ...decvax!{wivax,genrad}!linus!vaxine!ptw
			     (617) 667-7900 x2044