[net.unix-wizards] What is this panic?

spaf@stratus.UUCP (Gene Spafford) (11/28/84)

We've just recently brought up 4.2 BSD on 3 750 Vaxen.  Each Vax is
configured with 2 or 3Mbyte memory, Rev 7 CPU boards, DEUNA ethernet
drivers, a DZ-32 board, an RL02 disk, and a UDA50 disk controller
running 1, 2, or 3 RA81 disks.

All three machines keep dying with a (claimed) tbuf parity problem.
However, the value in the mcesr register indicates a bus error rather
than a tbuf error.  The PC of the last few faults was in some of the
UDA50 code (udrsp) or the hardclock routine, while others were in user
address space; this seems to rule out any likely direct correspondance
with any particular software module.

The problem appears whenever the machines are under load, but there is
no sure way to bring on the problem.  Rebuilding 2 or 3 copies of Unix
at the same time seems to bring it on regularly, but not all the time.
This problem is rather annoying, to say the least, and I've had little
success either tracking the problem down or getting much co-operation
out of some of our local DEC people ("If it isn't a problem that occurs
with DEC software, it isn't our problem.").

Has anybody out there seen this before?  Anybody have a fix or
suggestion where I go from here?  If so, please drop me some mail (I
don't always have time to read my news in this group).  I'm enclosing
some samples of the error summary printed on the console (and log)
whenever the problem occurs.

Thanks in advance.
--gene


machine check 2: cp tbuf par fault
	va 12f90 errpc 6dfc mdr aaaaaaaa smr b rdtimo 0 tbgpar 0 cacherr 5
	buserr 6 mcesr 9 pc 6df0 psl 3c00004 mcsr 80016
panic: mchk
trap type 2, code = 0, pc = 80000d76
panic: Reserved operand

machine check 2: cp tbuf par fault
	va 7fffeb38 errpc 157d mdr 2d smr b rdtimo 0 tbgpar 0 cacherr 5
	buserr 6 mcesr 9 pc 157b psl 3c00004 mcsr 80016
panic: mchk
trap type 2, code = 0, pc = 80000d76
panic: Reserved operand

machine check 2: cp tbuf par fault
	va 7fffec6c errpc c5f8 mdr 0 smr b rdtimo 0 tbgpar 0 cacherr 4
	buserr 6 mcesr 9 pc c5f8 psl 3c00000 mcsr 80016
panic: mchk
trap type 2, code = 0, pc = 80000d76
panic: Reserved operand

machine check 2: cp tbuf par fault
	va 8017dfd4 errpc 8001c1f8 mdr 7c smr 8 rdtimo 0 tbgpar 0 cacherr 5
	buserr 6 mcesr 9 pc 8001c1f3 psl c00000 mcsr 80016
panic: mchk
trap type 2, code = 0, pc = 80000d76
panic: Reserved operand

machine check 2: cp tbuf par fault
	va 800336c4 errpc 800271fc mdr 0 smr 8 rdtimo 0 tbgpar 0 cacherr 4
	buserr 6 mcesr 9 pc 800271f9 psl 4150004 mcsr 80016
panic: mchk
trap type 2, code = 0, pc = 80000d76
panic: Reserved operand

machine check 2: cp tbuf par fault
	va 800336c4 errpc 800271fc mdr 0 smr 8 rdtimo 0 tbgpar 0 cacherr 5
	buserr 6 mcesr 8 pc 800271f9 psl 4150000 mcsr 80016
panic: mchk
trap type 2, code = 0, pc = 80000d76
panic: Reserved operand

machine check 2: cp tbuf par fault
	va 800336c4 errpc 800271fc mdr 0 smr 8 rdtimo 0 tbgpar 0 cacherr 4
	buserr 6 mcesr 8 pc 800271f9 psl 4150000 mcsr 80016
panic: mchk
trap type 2, code = 0, pc = 80000d76
panic: Reserved operand

machine check 2: cp tbuf par fault
	va 7ffff1a4 errpc 8000b188 mdr 7ffff184 smr 8 rdtimo 0 tbgpar 0 cacherr 5
	buserr 6 mcesr 9 pc 8000b186 psl c00000 mcsr 80016
panic: mchk
trap type 2, code = 0, pc = 80000d76
panic: Reserved operand
-- 
Off the Wall of Gene Spafford
The Clouds Project, School of ICS, Georgia Tech, Atlanta GA 30332
CSNet:	Spaf @ GATech		ARPA:	Spaf%GATech.CSNet @ CSNet-Relay.ARPA
uucp:	...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-sally}!gatech!spaf

irwin@uiucdcs.UUCP (12/01/84)

We have 7 Vax-750s. Ours are all at Rev 3 CPU. We have the exact panic
you have described on 3 of the 7 quite frequently. We have a Ramtek
on one that can cause a crash from time to time, but this is a different
bug. We know that if the CPU were at Rev 7 the Ramtek bug would go away.

We have wondered if our machines were at Rev 7, if the panic bug would
disappear. Yours is at Rev 7 and do it, so that answers the question.
What causes it, still do not know.

Our disks are on CMI controllers, some CDC 300MB, some Eagles, some CDC 80MB.
Two of the 7 have hdwr floating point, no two of the seven are configured
alike.

jim@haring.UUCP (12/02/84)

There are rumoured to be problems with the Rev7 upgrade for 750's.
Since we had it done, one of our machines runs like a dream, no
tbuf panics, nothing. The other, which used to run without a crash
between maintenances, now crashes every now and then with a 'wtimo'
panic (write timout to memory). The first machine has all DEC memory,
the second has 1M DEC, 3M National Semiconductor. Like Gene's machine,
it seems to die under heavy load.

Does anyone know what the junk pushed on the stack with a write-timeout
error means? Is it any use in finding which memory board was in error?

Jim McKie    Centrum voor Wiskunde en Informatica, Amsterdam    mcvax!jim

irwin@uiucdcs.UUCP (12/04/84)

What is this panic? I had the opportunity to set down and have a talk
with the manager of our local branch of DEC.

To quote, "The 750s have a problem with translation buffer parity errors
when running 4.2BSD, if the Rev 7 has not been installed. (tbuf panics)
These errors go away if the machine is brought to Rev 7. In addition to
this, 4.2BSD also has problems with cache memory parity errors, which
also cause panic type crashes. These will be fixed with Rev 8, which will
be available in the spring of '85. There are fixes in our software, both
VMS and our version of UNIX, which gets around the bug, but not in 4.2."

I let him read the base note, to which this response is made, and pointed
out that the author stated that their machine was Rev 7 and was having
tbuf panics. He said that is the first case he knows of where those type
of crashes were a problem on a Rev 7 machine, running 4.2 and that there
well may be a hardware problem and may need a board replaced.

As to the memory time outs in the response before mine, it well may be
that the bug is a bad memory board. If there is a steady stream of errors,
the controller may be busy correcting errors, while at the same time getting
more errors, which hangs the controller so long that it does not respond
in time and a time out is declared. If 4.2 is being run on this machine,
it usually reports memory errors, and a look in /usr/adm/messages will
verify if there is a bad board. If there are, if the board can be removed,
it may be that the time out bug will go away.

bruce@godot.uucp (12/10/84)

In article <13700084@uiucdcs.UUCP> irwin@uiucdcs.UUCP writes:
>To quote, "The 750s have a problem with translation buffer parity errors
>when running 4.2BSD, if the Rev 7 has not been installed. (tbuf panics)
>These errors go away if the machine is brought to Rev 7. In addition to
>this, 4.2BSD also has problems with cache memory parity errors, which
>also cause panic type crashes. These will be fixed with Rev 8, which will
>be available in the spring of '85. There are fixes in our software, both
>VMS and our version of UNIX, which gets around the bug, but not in 4.2."

He's being a little hard by half-blaming 4.2bsd for what is a DEC
hardware bug.  Pre rev 7 750s have a problem with tbuf par errors
regardless of the software they are running.  Some of the errors are
recoverable and some are not; the proportion and frequency depend on the
particular board set.

VMS does go through more contortions trying to recover from it; the last
I looked, it tries to recover, and if it can't, it terminates the
running process if in user mode or crashes if in kernel mode.  4.2
panics when it can't immediately recover.  My machine likes to take
simultaneous tbuf parity, bus, and cache errors, which will also cause
VMS to give up the ghost.

I can't speak for rev 7; DEC has failed to show up twice this week to
install it (but what's another few days, I was originally told to expect
it in August).  Actually, after going through 4 different L0003 modules,
we finally have one which rarely gives the tbuf problem.

There was also a bug in the original 4.2bsd tape in which a status word
was masked incorrectly to determine whether it was a tbuf error, causing
it to fail to try to recover on about half of the tbuf errors.
-- 
--Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA
  ihnp4!godot!bruce, bjn@mit-mc.arpa ... soon to be bruce@godot.arpa

irwin@uiucdcs.UUCP (12/12/84)

A note to Bruce, it was not my intent to make it sound as if it was the
software that is at fault. It is 100% the problem of DEC's hardware and
0% the fault of the 4.2 software. I was <not> one half blaming 4.2BSD.

If it came across that way, I guess I did not express myself correctly.
I did say the 750 was having problems (tbuf) and 4.2 was having problems
(cache). I should have said the 750 was having problems in both cases.
That was the intent, I did not realize that saying 4.2 is having problems
running on the 750 would be taken to mean that 4.2 was the bad guy, when
one in the same paragraph states that rev 8 (hdwr fix) will cure it.

My appologies for not making it clearer.

irwin@uiucdcs.UUCP (12/12/84)

I probably should have pointed out in my response to Bruce, that the
paragraph in question was a quote of the gentleman from DEC. I was
trying to put it down as he had stated it, so that was the reason it
was written the way it was.