[net.unix-wizards] VAX 750 tbuf machine checks

RSanders@USGS2-MULTICS.ARPA (10/29/85)

I've been mostly ignoring the stuff about mchk 2's on VAX 750s because
mine rarely showed more than 2 per week.  However, last week we added a
TC7000 Massbus tape controller, so I genned a new kernal to talk to it.
Magically, this new kernal (not currently being used), generates mchk
2's roughly every 2 minutes - quite unacceptable.

Now, the question is, can I avoid the Rev 7/load microcode upgrade and
still talk to my new tape drive?  Obviously, something about the new
kernal is really pounding on the tbuf parity problem.  Any suggestions?
What was the final resolution to the discussions, anyway?

-- Rex

p.s.  config files available on request.

p.p.s.  obviously, we are running 4.2 BSD.

speck%cit-vlsi@CIT-VAX.ARPA (Don Speck) (10/31/85)

>   550 rsanders@usgs2-multics.arpa... User unknown
Sigh, have to post to everybody.

    Make sure that you are actually getting tbuf errors.
There are about a dozen problems that are printed out under
a "cp tbuf par flt" banner.  If the cache err register is
non-zero, then you actually have a cache error, etc.  Get
out your VAX hardware handbook.
    One of our 750's got cache errors about every 15 minutes
of non-idle time.  Yes, it was a Rev. 7 cpu, and yes, I had
the patch to correctly recognize tbuf par faults - neither of
which helped, because it wasn't the tbuf.  Swapping the L0003
board was the cure.  We went 'round and 'round with DEC Field
Circus, with them continually asserting "it's your software",
before we got them to consent to at least try swapping the
board.
	Don Speck	speck@cit-vax.arpa

brown@nicmad.UUCP (11/01/85)

In article <2535@brl-tgr.ARPA> RSanders@USGS2-MULTICS.ARPA writes:
[talks about tbuf parity faults]

Our final solution to the problem, which at first didn't appear very often,
to the point that we couldn't keep the thing up and running, was to get
a DEC certified board that was tested and run on UNIX 4.2BSD.  Previous
discussions talked about certain chip sets that gave problems.  Our local
DEC service guy was real helpful in getting us going again.  It took a few
tries at the board, but we got one that hasn't brought us down yet.

KNOCK ON WOOD :-)
-- 

Mr. Video   {seismo!uwvax!|!decvax|!ihnp4}!nicmad!brown

roy@phri.UUCP (Roy Smith) (11/02/85)

> Our final solution to the problem [...] was to get a DEC certified board
> that was tested and run on UNIX 4.2BSD.

	After 18 months of almost flawless operation, our 11/750 has
developed the dreaded "cp tbuf parity error" disease.  I've been going
around in circles with field service for 3 weeks now with no success (can
you say "let's swap the L0003 board again"?)

	DEC even went so far as to suggest that not having my uda-50 at the
end of the Unibus was to blame, and we spend a half a day pulling it out of
the system box and moving it to the expansion box.  Note, as a peripheral
matter, that the uda-50 came factory installed in the wrong slot, and all
this bus shuffling was done after we got a hard failure running ECKAL with
the entire Unibus disconnected!  Naturally, now we've started to develop
the dreaded "SDI error; event 053" disease" as well.

	Anyway, according to the field disservice branch manager here in
NY, DEC is no longer Unix-testing boards!  From what I can make out, the
group that was doing the Unix-testing doesn't exist any more.  Our F.S.
branch manager is looking into getting them to start up again and do
another batch of boards, but it doesn't look like that's going to happen.

	There seems to be Yet Another redesign of the L0003 module in the
works (the L0003-YA, I think) which will cure the problem, or so they say.
It looks like we're going to be a test site for the new board, so stay
tuned for further details.  Rumor has it that they are working on a rev 8,
which will *not* require people to load any new micro-code.  I don't know
if this new L0003 and rev 8 are the same thing, but I think they might be.

	If this article seem to be liberally sprinkled with phrases like
"seems to be", "looks like", "according to rumor", etc., there is a reason.
All of this is true to the best of my knowledge, but that knowledge is not
guaranteed to be very good.  Corrections and/or clarifications from more
knowledgeable people inside of DEC would be most appreciated.
-- 
Roy Smith <allegra!phri!roy>
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

piet@mcvax.UUCP (Piet Beertema) (11/08/85)

	>Our final solution to the problem ..... was to get a DEC certified
	>board that was tested and run on UNIX 4.2BSD.
	>It took a few tries at the board, but we got one that hasn't brought
	>us down yet.
It occurs to me then, that the only acceptable procedure for DEC to certify
a board should be to test it under 4.2BSD....

-- 
	Piet Beertema, CWI, Amsterdam
	(piet@mcvax.UUCP)

RSanders@USGS2-MULTICS.ARPA (11/13/85)

Our maintenance people claim that by bringing our 750 up to the highest
available rev level, we don't have to load microcode to fix the tbuf
problem.  Since this is not DEC maintenance, and I've never heard this
claim before, can anyone in net-land comment?  The key board swap
(almost everything is being replaced) seems to be from a L0005 board to
L0008 - if this means anything to anyone...

-- Rex
   RSanders.Pascalx@denver.arpa   (CASE is important [dumb mailer])
   ucbvax!menlo70!sanders          (slower than USPS)