[net.bugs.4bsd] mchk 2 --- tbuf error on 750 running 4.2 BSD

jeb@zeta.UUCP (John Berry) (07/23/85)

We are running VAX 11/750's with UNIX 4.2 BSD. We have just had DEC
install REV 7 of the L0003 board, which we hoped would clear up the
mchk 2 --- tbuf error problems. Well it has not. Can anyone out there
in network land give me any insight to what is happening. DEC cannot
find any problems when they run diagnostics. I am fairly new to reading
the net so if this has been discussed in the past my apologies.

spaf@gatech.CSNET (Gene Spafford) (07/25/85)

In article <83@zeta.UUCP> jeb@zeta.UUCP (John Berry) writes:
>
>We are running VAX 11/750's with UNIX 4.2 BSD. We have just had DEC
>install REV 7 of the L0003 board, which we hoped would clear up the
>mchk 2 --- tbuf error problems. Well it has not. Can anyone out there
>in network land give me any insight to what is happening. DEC cannot
>find any problems when they run diagnostics. 

This is an old and frustrating problem.  I've had it show up on at least
4 750's I've worked with.  The problem is, indeed, with the L0003
board.  Let me tell you how it has been explained to me (if anyone has
a more detailed explanation, please let us know).

DEC obtains chips for the L0003 board from a couple of different
sources.  I'm not sure if they subcontract the board out to another
firm or not, but they end up with two different versions of the board
which are identical in stated specs and (almost) identical in
appearence.  As far as acceptance goes, both versions of the board
behave identically under VMS and all the regular field service
diagnostics.

HOWEVER, under Unix, due to the way certain things are done and timed,
one version of the board will repeatedly generate tbuf parity faults
that cannot be recovered from.  The fix is to replace the board with a
copy of the other version.  Once we did that, our 750's in the lab
which crashed an average of 10 times a day have only encountered one
tbuf fault in 6 months.  To get a good board may require many swaps and
trials, because I have heard someone claim that you can't identify one
of the bad boards except by unsoldering chips and looking at the lot
numbers on the underside.

I don't know the specific chips or how to identify which version of the
board you have.  Supposedly, this problem is well known in the Ultrix
support group and some field service offices (along with the RA81
read/write board glitch and the Rev4/RL02 problem, and others) as one
of the strange problems that only shows up when using Unix.  Have your
field service people contact the Ultrix support group.  It is possible
that the Ultrix group may even know of a supply of working L0003 boards
for exactly this situation.

Best of luck!
-- 
Gene "4 months and counting" Spafford
The Clouds Project, School of ICS, Georgia Tech, Atlanta GA 30332
CSNet:	Spaf @ GATech		ARPA:	Spaf%GATech.CSNet @ CSNet-Relay.ARPA
uucp:	...!{akgua,allegra,hplabs,ihnp4,linus,seismo,ulysses}!gatech!spaf

dcmartin@sun.uucp (David C. Martin) (07/27/85)

In article <654@gatech.CSNET> spaf@gatech.UUCP (Gene Spafford) writes:
>In article <83@zeta.UUCP> jeb@zeta.UUCP (John Berry) writes:
>>
>>We are running VAX 11/750's with UNIX 4.2 BSD. We have just had DEC
>>install REV 7 of the L0003 board, which we hoped would clear up the
>>mchk 2 --- tbuf error problems. Well it has not. Can anyone out there
>>in network land give me any insight to what is happening. DEC cannot
>>find any problems when they run diagnostics. 
>
>This is an old and frustrating problem.  I've had it show up on at least
>4 750's I've worked with.  The problem is, indeed, with the L0003
>board.  Let me tell you how it has been explained to me (if anyone has
>a more detailed explanation, please let us know).

Okay, I will.  I already mailed John, but perhaps this could be rehashed
one more time.  The problem does lie in the L0003 board, but the solution 
is easy.  VMS has microcode to alleviate these parity problems, and 
using the /boot program which reads microcode off the disk, the problem
can be easily solved.  Mike Karels wrote up a patch and we have been running
it at UC Berkeley for quite some time with favorable results.  If there is
sufficient need, I will dig this up for those of you who need it, the microcode
loading program was previously posted to the NET, so check your archives for 
that.

-- 

David C. Martin - Sun Microsystems / UC Berkeley
uucp: ..!ucbvax!sun!dcmartin                   usps: 2280 California St #8
arpa: dcmartin@Berkeley                              Mountain View, CA 94040
at&t: 415/960-7458 (O) - 415/967-0506 (H)

rees@apollo.uucp (Jim Rees) (07/30/85)

Actually, the bsd4.2 tape we got still didn't have the right stuff for
tbuf parity errors.  In machdep.c, the line

		if ((mcf->mc5_mcesr&0xf) == MC750_TBPAR) {

should be

		if ((mcf->mc5_mcesr&0xe) == MC750_TBPAR) {

The low order bit indicates whether the error occured in the execution
buffer or not.  Since we don't care where the error occured, we just
mask out that bit.  See the section on "Machine Check Error Summary
Register" in the Vax Hardware Handbook.

Some other early fixes caught the tbuf par error fine, but then failed
to flush the buffer before returning, resulting in an endless loop.

The 4.1 code was even worse.  As I recall, hard errors were mistakenly
reported as tbuf parity errors, and soft errors were ignored, but I
could have that the wrong way around.  As far as I know all versions
of 4.2 fix this problem.

I have all this on good authority, but if I'm wrong about the TB flush
also flushing the XB, please set me straight.

grandi@noao.UUCP (Steve Grandi) (07/30/85)

> Okay, I will.  I already mailed John, but perhaps this could be rehashed
> one more time.  The problem does lie in the L0003 board, but the solution 
> is easy.  VMS has microcode to alleviate these parity problems, and 
> using the /boot program which reads microcode off the disk, the problem
> can be easily solved.  Mike Karels wrote up a patch and we have been running

Unfortunately, loading the proper microcode is not the complete solution.
Witness the following console output--

Jul  4 02:10
machine check 2: cp tbuf par fault
	va 802246f4 errpc 8001433b mdr 505 smr 8 rdtimo 0 tbgpar 3 cacherr 1
	buserr 8 mcesr c pc 80014336 psl c00000 mcsr 80318
panic: mchk
trap type 2, code = 0, pc = 80000fa2
panic: Reserved operand
trap type 2, code = 0, pc = 80000fa2
panic: Reserved operand
trap type 2, code = 0, pc = 80000fa2
panic: Reserved operand
trap type 2, code = 0, pc = 80000fa2
panic: Reserved operand
trap type 2, code = 0, pc = 80000fa2
panic: Reserved operand

4.2 BSD UNIX #5: Mon Jun 24 17:12:19 MST 1985
real mem  = 5238784
avail mem = 4198400
using 231 buffers containing 524288 bytes of memory
etc.

Maybe the combination of microcode rev. 98 (which we are already using) and 
the rev. 7 L003 board (which will be installed Someday, Real Soon Now)
will cure the problem and eliminate these irritating crashs.  But I doubt it.

Now the real question: Does anyone know why the system sometimes goes into
the mchk/Reserved operand panic loop shown above instead of trying its normal
recovery?  This happens on about half of our tbuf parity faults.
-- 
Steve Grandi, National Optical Astronomy Observatories, Tucson, AZ, 602-325-9228
{arizona,decvax,hao,ihnp4,seismo}!noao!grandi  noao!grandi@lbl-csam.ARPA

mp@allegra.UUCP (Mark Plotnick) (07/31/85)

I think the reserved operand panic loop can be fixed by changing
asm.sed so that spl1() [which is called in boot()] sets the priority
to something higher than 0; as 4.2bsd was distributed, spl1() does the
same thing as spl0().
I believe this is one of the old RWS@XX bug fixes.

spaf@gatech.CSNET (Gene Spafford) (08/01/85)

In article <2496@sun.uucp> dcmartin@sun.UUCP (David C. Martin) writes:
>Okay, I will.  I already mailed John, but perhaps this could be rehashed
>one more time.  The problem does lie in the L0003 board, but the solution 
>is easy.  VMS has microcode to alleviate these parity problems, and 
>using the /boot program which reads microcode off the disk, the problem
>can be easily solved.  Mike Karels wrote up a patch and we have been running
>it at UC Berkeley for quite some time with favorable results.  If there is
>sufficient need, I will dig this up for those of you who need it, the microcode
>loading program was previously posted to the NET, so check your archives for 
>that.
>
Nope, that isn't the whole fix.  The microcode fix only cures about 1/4
to 1/3 of the tbuf crashes (from our experience with the 3 750s in our
lab).  I installed the microcode-loading boot just about a week after
the machines came in, and it didn't cure the problem.  The new microcode
fixes a different bug that causes tbuf faults.

Also, before anyone posts something about how the whole thing can be
cured by a patch to the machine check processing code -- I know about
that patch too, and it doesn't fix the problem.

To repeat, the problem is a well known HARDWARE problem, and if your
field service people don't believe it, tell them to call the Ultrix
support center for confirmation; everybody there should know all about
the problem. Most of the old boards with the bad lot of chips (I have
been told that the only way to identify some of them is to unsolder the
chips and read the lot numbers off the bottom) have been replaced or
installed in VMS systems where the problem will go unnoticed.
Unfortunately, some field service people don't know about the problem,
or blame it on Unix (because they don't understand).  One site I know
of had the field engineer swap out the L0003 board twice, and the
problem didn't go away.  He claimed that it had to be Unix, and as a
non-supported product he was not responsible for anything else.  The
problem was that the two boards he swapped out were spares that had
been sitting at the local office for months, and they had the faulty
chips.  Don't let this happen to you!

-- 
Gene "4 months and counting" Spafford
The Clouds Project, School of ICS, Georgia Tech, Atlanta GA 30332
CSNet:	Spaf @ GATech		ARPA:	Spaf%GATech.CSNet @ CSNet-Relay.ARPA
uucp:	...!{akgua,allegra,hplabs,ihnp4,linus,seismo,ulysses}!gatech!spaf

chris@umcp-cs.UUCP (Chris Torek) (08/01/85)

There is another problem with panic: you can get bogus "panic: sleep"s.
I fixed this a while back.  In /sys/sys/kern_synch.c, change the top of
sleep() to look like this:

sleep(chan, pri)
	caddr_t chan;
	int pri;
{
	register struct proc *rp, **hp;
	register s;

	rp = u.u_procp;
	s = spl6();
	if (panicstr) {
		/*
		 * Let interrupts in for a moment, then just return.
		 * The splnet() really ought to be spl0(), but I'm
		 * too timid to do that.
		 */
		(void) splnet();
		splx(s);
		return;
	}
	if (chan == 0 || rp->p_stat != SRUN || rp->p_rlink)
		panic("sleep");
	.
	.
	.

(The splnet lets network interrupts through, so that the network
disk stuff (remote mount file systems) can finish syncing.  splnet
is also < spl6, so disk interrupts get through too.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland