hurf@batcomputer.tn.cornell.edu (Hurf Sheldon) (08/12/87)
I have been getting the appended messages preceding a crash on a uVaxII running Ultrix1.2 with the following hardware: cpu 8meg Mnemonix board 1meg DEC board decqna dzv-11 dhv-11 dhv-11 dhv-11 sdc-rqd11-ec SI/sigma ESDI disc controller qdss tqk50 rqdx3 There is swap on both disk controllers This problem does not correlate with the installation of any hardware or software The only consistent things I see are the mser (memory system error register), the caer (cpu error address) and the daer (dma error address) The mser should be loaded with bits saying what the error is but I cannot find explanations in Ultrix for what they are - BSD has ka630.h (thanks Chris). The fact that the caer/daer are always the same makes me think there is a dma i/o problem and that in turn points to a disk controller problem or the dequna as the random times would seem to rule out the video and the dhv's as the crashes don't correlate at all with their use. I would appreciate: A; definitions of the terms in the error message- ie sumpar, etc B; hints on where in the documentation to find out more C; Any concrete interpretations of the data presented. D; Suggestions on how to approach a problem like this Aug 10 07:30 machine check 82: write bus error, VAP is virtual sumpar = 82 most recent virtual addr =8005fdb0 internal state =12080000 pc = 8002f98a psl = 4160008 mcesr = 0 mser = 241 caer = 3756 daer = 3756 panic: mchk panic: sleep Aug 10 10:50 machine check 80: read bus error, VAP is virtual sumpar = 80 most recent virtual addr =8001bfd8 internal state =2000002 pc = 8001bfd1 psl = d60001 mcesr = 0 mser = 241 caer = 35a0 daer = 35a0 panic: mchk Aug 12 02:10 machine check 80: read bus error, VAP is virtual sumpar = 80 most recent virtual addr =8002c428 internal state =2080003 pc = 8002c41f psl = 40c0008 mcesr = 0 mser = 241 caer = 36b7 daer = 36b7 panic: mchk panic: sleep Aug 12 11:00 machine check 80: read bus error, VAP is virtual sumpar = 80 most recent virtual addr =7ffffe78 internal state =3080001 pc = 800018c1 psl = d80008 mcesr = 0 mser = 241 caer = 3587 daer = 3587 panic: mchk machine check 80: read bus error, VAP is virtual sumpar = 80 most recent virtual addr =7fffe480 internal state =2040006 pc = 8000993f psl = 4160004 mcesr = 0 mser = 249 caer = 3587 daer = 3587 panic: mchk -- Hurf Sheldon Network: hurf@ionvax.tn.cornell.edu Lab of Plasma Studies Bitnet: hurf@CRNLION 369 Upson Hall, Cornell University, Ithaca, N.Y. 14853 ph:607 255 7267 I sold my Elan, got a job in science; Now, no one takes me seriously.
sullivan@marge.math.binghamton.edu (fred sullivan) (08/12/87)
In article <1988@batcomputer.tn.cornell.edu> hurf@tcgould.tn.cornell.edu (Hurf Sheldon) writes: > > I have been getting the appended messages preceding a crash on > a uVaxII running Ultrix1.2 with the following hardware: > >machine check 82: write bus error, VAP is virtual I don't have all the info you want, but I can tell you the problem. Last week our MVII crashed twice in 2 days with similar messages. Our field service engineer knows little or nothing about Ultrix, so he called his support center, read the messages to them, and they not only told him there was a bad memory board, but they told him which board. He replaced the board, and all is now well. Fred Sullivan Department of Mathematical Sciences State University of New York at Binghamton Binghamton, New York 13903 Email: sullivan@marge.math.binghamton.edu
karl@grebyn.UUCP (08/14/87)
In article <633@bingvaxu.cc.binghamton.edu>, sullivan@marge.math.binghamton.edu (fred sullivan) writes: > In article <1988@batcomputer.tn.cornell.edu> hurf@tcgould.tn.cornell.edu (Hurf Sheldon) writes: > > > > I have been getting the appended messages preceding a crash on > > a uVaxII running Ultrix1.2 with the following hardware: > > > >machine check 82: write bus error, VAP is virtual > [Text of first reply recommending replacement of memory board deleted to make postnews happy.] Another way that I fixed this problem when I encountered it was to pull any other boards (e.g., controllers, etc.) out of the memory slots. Although you are supposedly allowed to have other boards in slots 2 & 3 of the Qbus, I found there were some glitches that caused this sort of behavior. So, I threw the controllers further down the bus, and haven't had this sort of problem since. -- Karl -- DDN: nyberg@ada20.isi.edu INET: karl@grebyn.com - AKA - karl%grebyn.com@seismo.css.gov uucp: {decuac, seismo}!grebyn!karl
johnd@physiol.su.oz (John Dodson) (08/17/87)
In article <1988@batcomputer.tn.cornell.edu>, hurf@batcomputer.tn.cornell.edu (Hurf Sheldon) writes: > > I have been getting the appended messages preceding a crash on > a uVaxII running Ultrix1.2 with the following hardware: . . . > The only consistent things I see are the mser (memory system error > register), the caer (cpu error address) and the daer (dma error address) but never with the same value are they ! ... 'cos that would have indicated a bad board or location > The mser should be loaded with bits saying what the error is but I cannot > find explanations in Ultrix for what they are - BSD has ka630.h (thanks > Chris). The fact that the caer/daer are always the same makes me think > there is a dma i/o problem and that in turn points to a disk controller > problem or the dequna as the random times would seem to rule out the > video and the dhv's as the crashes don't correlate at all with their use. > > I would appreciate: > A; definitions of the terms in the error message- ie sumpar, etc read the KA630-AA CPU Module Users Guide (DEC ref. EK-KA630-UG-001) Architecture section. (don't know what "sumpar" is tho' !) > B; hints on where in the documentation to find out more as above > C; Any concrete interpretations of the data presented. when there are random memory errors I immediately suspect LONG Private Memory Interconnect cables... when I say long I mean they should be so short they will only just fit between the boards. (3cm between connectors seems the max length) This problem is particularly prevalent with OEM memories and early (NEC memory chips) versions of the KA630... at least that is "in my experience" ! > D; Suggestions on how to approach a problem like this as above WHILE I'M HERE... is anyone aware of a problem with early KA630's where the TOY clock after a power fail leaves the VRT bit set but clears the clock memory ? it means 4.3 comes up with a date near the epoch !(Jan 1970) and fsck rebuilds all the "SUMMARY" information. (we have added a check in the ka630.c file to check for a zero'ed clock & use file system time but it would be nice to fix it properly !) DEC Australia are currently charging $10,000 (Australian) for a KA630 board swap ! so getting it fixed that way is out of the question ! John Dodson ACSnet: johnd@physiol.su.oz.au most other places ! : seismo!munnari!physiol.su.oz!johnd
sullivan@marge.math.binghamton.edu (fred sullivan) (08/17/87)
In article <4682@grebyn.COM> karl@grebyn.COM (Karl A. Nyberg) writes: >In article <633@bingvaxu.cc.binghamton.edu>, sullivan@marge.math.binghamton.edu (fred sullivan) writes: >> In article <1988@batcomputer.tn.cornell.edu> hurf@tcgould.tn.cornell.edu (Hurf Sheldon) writes: >> > >> > I have been getting the appended messages preceding a crash on >> > a uVaxII running Ultrix1.2 with the following hardware: >> > >> >machine check 82: write bus error, VAP is virtual >> >Another way that I fixed this problem when I encountered it was to pull any >other boards (e.g., controllers, etc.) out of the memory slots. Although >you are supposedly allowed to have other boards in slots 2 & 3 of the Qbus, >I found there were some glitches that caused this sort of behavior. So, I >threw the controllers further down the bus, and haven't had this sort of >problem since. > I think it's appropriate for me to give an update here. My first reply was to the effect that replacing a memory board fixed the problem. Literally ten minutes after I posted that reply, our machine crashed again, (similar machine checks). This time DEC replaced the CPU board and both memory boards. It stayed up long enough for Field Service to leave and then crashed twice in 10 minutes. After a few more crashes, they came back and substituted 1 single 8 MEG board for the 2 4 MEG boards. It stayed up over the weekend, so I'm beginning to believe that slot 2 is an unusable slot. On the other hand, we have a second (identical) machine, which has had no problems whatsoever. Fred Sullivan Department of Mathematical Sciences State University of New York at Binghamton Binghamton, New York 13903 Email: sullivan@marge.math.binghamton.edu