[comp.unix.wizards] Ultrix1.2-uVaxII crashing - Help requested

hurf@batcomputer.tn.cornell.edu (Hurf Sheldon) (08/12/87)

 I have been getting the appended messages preceding a crash on
 a uVaxII running Ultrix1.2 with the following hardware:
 cpu
 8meg Mnemonix board
 1meg DEC board
 decqna
 dzv-11
 dhv-11
 dhv-11
 dhv-11
 sdc-rqd11-ec SI/sigma ESDI disc controller
 qdss
 tqk50
 rqdx3
 There is swap on both disk controllers

 This problem does not correlate with the installation of any hardware or software

 The only consistent things I see are the mser (memory system error
 register), the caer (cpu error address) and the daer (dma error address)

 The mser should be loaded with bits saying what the error is but I cannot
 find explanations in Ultrix for what they are - BSD has ka630.h (thanks
 Chris). The fact that the caer/daer are always the same makes me think
 there is a dma i/o problem and that in turn  points to a disk controller
 problem or the dequna as the random times would seem to rule out the
 video and the dhv's as the crashes don't correlate at all with their use.

 I would appreciate:
 A; definitions of the terms in the error message- ie sumpar, etc

 B; hints on where in the documentation to find out more

 C; Any concrete interpretations of the data presented.

 D; Suggestions on how to approach a problem like this 
Aug 10 07:30

machine check 82: write bus error, VAP is virtual
	sumpar	= 82
	most recent virtual addr	=8005fdb0
	internal state	=12080000
	pc	= 8002f98a
	psl	= 4160008

	mcesr	= 0
	mser	= 241
	caer	= 3756
	daer	= 3756
panic: mchk
panic: sleep

Aug 10 10:50

machine check 80: read bus error, VAP is virtual
	sumpar	= 80
	most recent virtual addr	=8001bfd8
	internal state	=2000002
	pc	= 8001bfd1
	psl	= d60001

	mcesr	= 0
	mser	= 241
	caer	= 35a0
	daer	= 35a0
panic: mchk

Aug 12 02:10

machine check 80: read bus error, VAP is virtual
	sumpar	= 80
	most recent virtual addr	=8002c428
	internal state	=2080003
	pc	= 8002c41f
	psl	= 40c0008

	mcesr	= 0
	mser	= 241
	caer	= 36b7
	daer	= 36b7
panic: mchk
panic: sleep

Aug 12 11:00

machine check 80: read bus error, VAP is virtual
	sumpar	= 80
	most recent virtual addr	=7ffffe78
	internal state	=3080001
	pc	= 800018c1
	psl	= d80008

	mcesr	= 0
	mser	= 241
	caer	= 3587
	daer	= 3587
panic: mchk

machine check 80: read bus error, VAP is virtual
	sumpar	= 80
	most recent virtual addr	=7fffe480
	internal state	=2040006
	pc	= 8000993f
	psl	= 4160004

	mcesr	= 0
	mser	= 249
	caer	= 3587
	daer	= 3587
panic: mchk
-- 
     Hurf Sheldon			 Network: hurf@ionvax.tn.cornell.edu
     Lab of Plasma Studies		  Bitnet: hurf@CRNLION
     369 Upson Hall, Cornell University, Ithaca, N.Y. 14853  ph:607 255 7267
     I sold my Elan, got a job in science; Now, no one takes me seriously.

sullivan@marge.math.binghamton.edu (fred sullivan) (08/12/87)

In article <1988@batcomputer.tn.cornell.edu> hurf@tcgould.tn.cornell.edu (Hurf Sheldon) writes:
>
> I have been getting the appended messages preceding a crash on
> a uVaxII running Ultrix1.2 with the following hardware:
>
>machine check 82: write bus error, VAP is virtual

I don't have all the info you want, but I can tell you the problem.  Last
week our MVII crashed twice in 2 days with similar messages.  Our field
service engineer knows little or nothing about Ultrix, so he called his
support center, read the messages to them, and they not only told him there
was a bad memory board, but they told him which board.  He replaced the
board, and all is now well.

Fred Sullivan
Department of Mathematical Sciences
State University of New York at Binghamton
Binghamton, New York  13903
Email: sullivan@marge.math.binghamton.edu

karl@grebyn.UUCP (08/14/87)

In article <633@bingvaxu.cc.binghamton.edu>, sullivan@marge.math.binghamton.edu (fred sullivan) writes:
> In article <1988@batcomputer.tn.cornell.edu> hurf@tcgould.tn.cornell.edu (Hurf Sheldon) writes:
> >
> > I have been getting the appended messages preceding a crash on
> > a uVaxII running Ultrix1.2 with the following hardware:
> >
> >machine check 82: write bus error, VAP is virtual
> 

[Text of first reply recommending replacement of memory board deleted to
make postnews happy.]

Another way that I fixed this problem when I encountered it was to pull any
other boards (e.g., controllers, etc.) out of the memory slots.  Although
you are supposedly allowed to have other boards in slots 2 & 3 of the Qbus,
I found there were some glitches that caused this sort of behavior.  So, I
threw the controllers further down the bus, and haven't had this sort of
problem since.

-- Karl -- 
DDN:	nyberg@ada20.isi.edu
INET:   karl@grebyn.com - AKA - karl%grebyn.com@seismo.css.gov
uucp:   {decuac, seismo}!grebyn!karl

johnd@physiol.su.oz (John Dodson) (08/17/87)

In article <1988@batcomputer.tn.cornell.edu>, hurf@batcomputer.tn.cornell.edu (Hurf Sheldon) writes:
> 
>  I have been getting the appended messages preceding a crash on
>  a uVaxII running Ultrix1.2 with the following hardware:
.
.
.
>  The only consistent things I see are the mser (memory system error
>  register), the caer (cpu error address) and the daer (dma error address)

but never with the same value are they ! ...
'cos that would have indicated a bad board or location

>  The mser should be loaded with bits saying what the error is but I cannot
>  find explanations in Ultrix for what they are - BSD has ka630.h (thanks
>  Chris). The fact that the caer/daer are always the same makes me think
>  there is a dma i/o problem and that in turn  points to a disk controller
>  problem or the dequna as the random times would seem to rule out the
>  video and the dhv's as the crashes don't correlate at all with their use.
> 
>  I would appreciate:
>  A; definitions of the terms in the error message- ie sumpar, etc

read the KA630-AA CPU Module Users Guide (DEC ref. EK-KA630-UG-001)
Architecture section. (don't know what "sumpar" is tho' !)

>  B; hints on where in the documentation to find out more

as above

>  C; Any concrete interpretations of the data presented.

when there are random memory errors I immediately suspect LONG Private
Memory Interconnect cables...
when I say long I mean they should be so short they will only just fit
between the boards. (3cm between connectors seems the max length)
This problem is particularly prevalent with OEM memories and early
(NEC memory chips) versions of the KA630...
at least that is "in my experience" !

>  D; Suggestions on how to approach a problem like this 

as above


WHILE I'M HERE...
is anyone aware of a problem with early KA630's where the TOY clock
after a power fail leaves the VRT bit set but clears the clock memory ?
it means 4.3 comes up with a date near the epoch !(Jan 1970)
and fsck rebuilds all the "SUMMARY" information. (we have added a check
in the ka630.c file to check for a zero'ed clock & use file system time
but it would be nice to fix it properly !)

DEC Australia are currently charging $10,000 (Australian) for a KA630
board swap ! so getting it fixed that way is out of the question !

John Dodson

ACSnet:				johnd@physiol.su.oz.au

most other places ! :		seismo!munnari!physiol.su.oz!johnd

sullivan@marge.math.binghamton.edu (fred sullivan) (08/17/87)

In article <4682@grebyn.COM> karl@grebyn.COM (Karl A. Nyberg) writes:
>In article <633@bingvaxu.cc.binghamton.edu>, sullivan@marge.math.binghamton.edu (fred sullivan) writes:
>> In article <1988@batcomputer.tn.cornell.edu> hurf@tcgould.tn.cornell.edu (Hurf Sheldon) writes:
>> >
>> > I have been getting the appended messages preceding a crash on
>> > a uVaxII running Ultrix1.2 with the following hardware:
>> >
>> >machine check 82: write bus error, VAP is virtual
>> 
>Another way that I fixed this problem when I encountered it was to pull any
>other boards (e.g., controllers, etc.) out of the memory slots.  Although
>you are supposedly allowed to have other boards in slots 2 & 3 of the Qbus,
>I found there were some glitches that caused this sort of behavior.  So, I
>threw the controllers further down the bus, and haven't had this sort of
>problem since.
>
I think it's appropriate for me to give an update here.  My first reply was
to the effect that replacing a memory board fixed the problem.  Literally
ten minutes after I posted that reply, our machine crashed again, (similar
machine checks).  This time DEC replaced the CPU board and both memory
boards.  It stayed up long enough for Field Service to leave and then
crashed twice in 10 minutes.  After a few more crashes, they came back and
substituted 1 single 8 MEG board for the 2 4 MEG boards.  It stayed up over
the weekend, so I'm beginning to believe that slot 2 is an unusable slot.
On the other hand, we have a second (identical) machine, which has had no 
problems whatsoever.

Fred Sullivan
Department of Mathematical Sciences
State University of New York at Binghamton
Binghamton, New York  13903
Email: sullivan@marge.math.binghamton.edu