jim (03/31/83)
I forgot what I was going to say... Oh yeah! Last week I started getting Very Strange Behavior on my 750 running 4.1bsd. Ordinary programs, like "cat" and "ls", started dumping core. The machine crashed twice in two hours. Suspecting hardware, I ran diagnostics and turned up a bad memory board. Now I don't mind things breaking occasionally. Beaver has been pretty good to me, compared to some other machines around here (no names). But it disturbs me that bad memory first manifested itself in flakey behavior, rather than in some notification from the kernel of parity errors. Why didn't my Unix kernel catch this problem? I would be interested in hearing from anyone who has had memory go out on them, and what the kernel did (or did not) tell them about the condition of the machine. Jim Rees Jim@uw-beaver (Arpa or uucp)
ptw (04/01/83)
We had problems with some El-cheapo (non-DEC) memory in our 750, which were being incorrectly reported as "tbuf par err". We got a patch to machdep.c (attributed to S. Leffler) verbally before we were on USENET that was supposed to cure our translation buffer problems, while we waited for DEC to fix the hardware (we're still waiting). It turned out that we were not really having that many translation buffer problems (the hint was that they always got worse when we put our flaky-puff memory on line). A careful examination of the hardware revealed that Unix was not being careful enough and interpreting bus errors (memory double-bit errors) as translation buffer errors. I suspect that if you had only the Leffler change, you might just continue on after a memory error (thinking you had cleared the translation error) and have funny things happen. We made the following change to Leffler's change, so that unrecoverable memory errors crash, but real translation buffer errors are still cleared and processing continues. (There probably should be a limit on the number of times you will clear and continue in a certain time period... ) We have also returned our El-cheapo's. 583d582 < #define MC750_tbpar 2 623a623 > #define MC750_tbpar 4 646a647 > mtpr(TBIA, 0); /* assume bad, ala VMS */ 690c691 < if ((type&0xf) == MC750_tbpar) { --- > if ((mcf->mc5_mcesr&0xf) == MC750_tbpar) { 692d692 < mtpr(TBIA, 0); The complete change to machdep.c from the 4.1 sources is as follows: 572a573,574 > * Except on translation buffer errors, which are recoverable by invalidating > * the buffer and continuing. 620a623 > #define MC750_tbpar 4 643a647 > mtpr(TBIA, 0); /* assume bad, ala VMS */ 680c684 < printf("\tva %x errpc %x mdr %x smr %x tbgpar %x cacherr %x\n", --- > printf("\tva %x errpc %x mdr %x smr %x rdtimo %x tbgpar %x cacherr %x\n", 686a691,694 > if ((mcf->mc5_mcesr&0xf) == MC750_tbpar) { > printf("tbuf par: flushing and returning\n"); > return; > } P. Tucker Withington Automatix Incorporated ...decvax!{wivax,genrad}!linus!vaxine!ptw (617) 667-7900 x2044