[comp.unix.questions] Help on deciphering crash

davem@sdcrdcf.UUCP (David Melman) (12/30/86)

Our Vax 750 running 4.2BSD has occassionally been crashing with:
----------------------------------
machine check 2: cp tbuf par fault
	va 80039728 errpc 8000394e mdr a smr 8 rdtimo 0 tbgpar 0 cacherr 5
	busserr 6 mcesr 9 pc 8000394e ps1 40c0008 mcsr 80016
panic: mchk
panic: sleep
----------------------------------
Where do I start to find the problem?

Thanks,
David Melman
UNISYS
UUCP: {hplabs, ihnp4, cbatt}!sdcrdcf!davem

chris@mimsy.UUCP (Chris Torek) (12/31/86)

In article <3645@sdcrdcf.UUCP> davem@sdcrdcf.UUCP (David Melman) writes:
>Our Vax 750 running 4.2BSD has occassionally been crashing with:
>machine check 2: cp tbuf par fault
[lots of registers]
>panic: mchk
>panic: sleep

There are two interrelated fixes for this.  Both are already in
4.3BSD.  The first is that some tbuf parity errors can be corrected
by flushing the translation buffer.  As I recall, 4.2 has code to
do this, but has the wrong test to determine whether it will suffice,
masking with an 0xf somewhere where it should be masking with 0xe.
The second is a `jelloware' (writable control store) fix for a
timing problem in one CPU module.  The 4.3 boot program knows to
load the file `pcs750.bin' into the 750 patch store.  The code to
do this is not terribly large, and is all contained in /sys/stand/boot.c
at your nearest 4.3 site, which also has /pcs750.bin.

Incidentally, the `panic: sleep' is due to a bug in sleep that
affects things only after a previous panic.  I fixed this in our
4.2 kernels back when Jim O'Toole and I were writing a kernel XNS.
I was rather amused to find the very same fix in the 4.3-alpha
kernel.  It helps considerably when you crash your machine several
times a day!

Also incidentally, the 4.3 boot program has no way to avoid loading
the /pcs750.bin file, something I consider a bug (now that I have
been bit by it).  We recently had a 750 go down for two weeks.
The long downtime was caused by three virtually simultaneous
failures.  First, one of two CDC9771 HDAs died suddenly.  Second,
our standby disk system (two RK07s) had some sort of controller
backplane problem (considering how often we use the RK07s, it may
have developed long ago).  Third, and only discovered last Friday,
our WCS board went out at the same time as the HDA.  As long as I
did not load the microcode update, the machine would boot.  With
the microcode in place, the machine would hang completely: not even
control-P did anything.  While this hardware failure might be quite
rare, it forced me to consider what would happen if part of
/pcs750.bin were overwritten.  I added another boot flag to
prevent the microcode update.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
UUCP:	seismo!mimsy!chris	ARPA/CSNet:	chris@mimsy.umd.edu

mangler@cit-vax.Caltech.Edu (System Mangler) (01/04/87)

In article <3645@sdcrdcf.UUCP> davem@sdcrdcf.UUCP (David Melman) writes:
> Our Vax 750 running 4.2BSD has occassionally been crashing with:
> machine check 2: cp tbuf par fault
>	va 80039728 errpc 8000394e mdr a smr 8 rdtimo 0 tbgpar 0 cacherr 5
>	busserr 6 mcesr 9 pc 8000394e ps1 40c0008 mcsr 80016

In article <4891@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
> There are two interrelated fixes for this.  Both are already in
> 4.3BSD.  The first is that some tbuf parity errors can be corrected [...]

Read the registers.  This is a cache parity error, not a tbuf parity
error.	Never mind that 4.[23] doesn't distinguish between the two.

We get these all the time.  There are two ways to "fix" it:  swap
L0003 boards until you get a good one ($$$), or change the machine
check handler to flush the cache and return.  Now, can anyone tell
me how to flush the cache?

Don Speck   speck@vlsi.caltech.edu  {seismo,rutgers,ames}!cit-vax!speck

chris@mimsy.UUCP (Chris Torek) (01/04/87)

>In article <3645@sdcrdcf.UUCP> davem@sdcrdcf.UUCP (David Melman) writes:
>>Our Vax 750 running 4.2BSD has occassionally been crashing with:
>>machine check 2: cp tbuf par fault
>>	va 80039728 errpc 8000394e mdr a smr 8 rdtimo 0 tbgpar 0 cacherr 5
>>	busserr 6 mcesr 9 pc 8000394e ps1 40c0008 mcsr 80016

>In article <4891@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
>>There are two interrelated fixes for this.  Both are already in
>>4.3BSD.  The first is that some tbuf parity errors can be corrected [...]

In article <1419@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu
(System Mangler) writes:
>Read the registers.  This is a cache parity error, not a tbuf parity
>error.  Never mind that 4.[23] doesn't distinguish between the two.

Sure enough.  I never bothered to read the bits, knowing that `this
occurs all the time and is always a tbuf error'.

>We get these all the time.  There are two ways to "fix" it:  swap
>L0003 boards until you get a good one ($$$), or change the machine
>check handler to flush the cache and return.  Now, can anyone tell
>me how to flush the cache?

Maybe the microcode fix helps this too?  I have never seen a cache
error here (but tb errors were extremely rare too: probably a
consequence of our ordering our 750s with Ultrix 1.0 way back when.)

Anyway, you could try disabling the cache:

	mtpr(CADR, 1);	/* CADR is register 0x25 */

but that will probably slow the machine to a crawl.  Disabling
and reenabling the cache might well flush it, though.  If

	mtpr(CADR, 1);
	mtpr(CADR, 0);

does not clear the problem, perhaps reenabling it after a long
delay will:

	mtpr(CADR, 1);
	timeout(cacheenable, (caddr_t) 0, 10*hz);
	...

cacheenable()
{

	mtpr(CADR, 0);
}

But according to the registers I can read above (DEC's latest VAX
Hardware Handbook does NOT include machine check frames---why?),
returning may not help too much in this case, because the machine
check error summary register (mcesr) has bit 8 set, bus error.
Returning to the failed instruction may well not retry the failed
read.  Since it occurred in kernel mode, that might bring the
machine down anyway.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
UUCP:	seismo!mimsy!chris	ARPA/CSNet:	chris@mimsy.umd.edu

dave@onfcanim.UUCP (Dave Martindale) (01/05/87)

In article <1419@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (System Mangler) writes:
>We get these all the time.  There are two ways to "fix" it:  swap
>L0003 boards until you get a good one ($$$), or change the machine
>check handler to flush the cache and return.  Now, can anyone tell
>me how to flush the cache?

At Waterloo, someone added code to set the "force cache miss" bits,
then access the address that got the parity fault; the idea being that
this might cause the bad cache entry to be cleared.  Without any way
to generate cache errors on demand it's hard to check whether the code
really works as designed.  However, it's been running on perhaps a
dozen machines for a year or two, so it is at least benign.
The code looks like this; it's for a 780 so the magic bits may be
somewhere else on a 750:

			/* Force Cache Miss and Replace */
			mtpr(SBIMT, mfpr(SBIMT) | 0x1e000);
			i = *(int *)mcf->mc8_vaviba;	/* Access address */

			/* Return to normal */
			mtpr(SBIMT, mfpr(SBIMT) & ~0x1e000);

richards@uiucdcsb.UUCP (01/06/87)

Chris Torek asks:

>  (DEC's latest VAX Hardware Handbook does NOT include machine check
>   frames---why?)

Why?  I don't know.  But, I found them in the new hardbound
"VAX Architecture Reference Manual" (from Digital Press) in Appendix B,
Implementation Dependencies.

Paul Richards	University of Illinois at Urbana-Champaign, Dept of Comp Sci
	UUCP:	{pur-ee,convex,inhp4}!uiucdcs!richards
	ARPA:	richards@b.cs.uiuc.edu

tropp@cthct.UUCP (Ulf Tropp) (01/08/87)

In article <4914@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>>In article <3645@sdcrdcf.UUCP> davem@sdcrdcf.UUCP (David Melman) writes:
>>>Our Vax 750 running 4.2BSD has occassionally been crashing with:
>>>machine check 2: cp tbuf par fault
>>>	va 80039728 errpc 8000394e mdr a smr 8 rdtimo 0 tbgpar 0 cacherr 5
>>>	busserr 6 mcesr 9 pc 8000394e ps1 40c0008 mcsr 80016
>
>Anyway, you could try disabling the cache:
>
>	mtpr(CADR, 1);	/* CADR is register 0x25 */
>
>but that will probably slow the machine to a crawl.  Disabling
>and reenabling the cache might well flush it, though.  If
>
>	mtpr(CADR, 1);
>	mtpr(CADR, 0);
>
>does not clear the problem, perhaps reenabling it after a long
>delay will.

We had a lousy cache once that would cause a mchk approximately
once an hour. Since DEC couldn't supply a new board in a week,
I had plenty of time to test recovery code. What I did was essentially:

		mtpr(CADR,1);
		if(mcf->mc5_cacherr&0xe){
			mtpr(CAER,0xf);
			/* fetch offending byte w/o cache */
			if(mcf->mc5_va&0x80000000)
				i = *((char *)mcf->mc5_va);
			else
				i = fubyte(mcf->mc5_va);
			if(mfpr(CAER)&0xe){
				return; /* run without cache */
			}
			printf("Cache reenabled\n");
			mtpr(CADR,0);
		}
		return;

Probably not entirely correct, but id did seem to work:
the sytem would mostly return orderly to the aborted instruction,
sometimes going directly into a new mchk a couple of times.

Anyway, does somebody know about which instructions that can be
restarted? Shouldn't anyone that can generate a page fault?

BTW, a comment in the 4.2 tbuf recovery code says "Should we use
pc or errpc.." (when looking at the instruction to return to).
Clearly it must be pc, since that is what we is returning to,
so I changed the 4.2 code.

In-Real-Life: 	Ulf Tropp
		Systems Administrator
		Dept. of Computer Engineering
		Chalmers Univ. of Technology
		S-412 96 Gothenburg
		Sweden

UUCP:	..mcvax!enea!chalmers!cthct!tropp
ARPA:	tropp%cthct.uucp@seismo.CSS.GOV (?)

mouse@mcgill-vision.UUCP (01/11/87)

In article <4914@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
> Anyway, you could try disabling the cache:
> 	mtpr(CADR, 1);	/* CADR is register 0x25 */
> but that will probably slow the machine to a crawl.

In case anyone is curious....I just tried this.  My check of how slow
the machine was was a user level program that did

	while (! signaled)
	 { count ++;
	 }

and asked for a SIGALRM in ten seconds.  Then the speed measure is how
far count gets.  With the cache turned off it counted only half as far
(plus or minus maybe 10%) as with the cache on.

This was a 750 running MtXinu 4.3+NFS.

					der Mouse

USA: {ihnp4,decvax,akgua,utzoo,etc}!utcsri!mcgill-vision!mouse
     think!mosart!mcgill-vision!mouse
Europe: mcvax!decvax!utcsri!mcgill-vision!mouse
ARPAnet: think!mosart!mcgill-vision!mouse@harvard.harvard.edu