davem@sdcrdcf.UUCP (David Melman) (12/30/86)
Our Vax 750 running 4.2BSD has occassionally been crashing with: ---------------------------------- machine check 2: cp tbuf par fault va 80039728 errpc 8000394e mdr a smr 8 rdtimo 0 tbgpar 0 cacherr 5 busserr 6 mcesr 9 pc 8000394e ps1 40c0008 mcsr 80016 panic: mchk panic: sleep ---------------------------------- Where do I start to find the problem? Thanks, David Melman UNISYS UUCP: {hplabs, ihnp4, cbatt}!sdcrdcf!davem
chris@mimsy.UUCP (Chris Torek) (12/31/86)
In article <3645@sdcrdcf.UUCP> davem@sdcrdcf.UUCP (David Melman) writes: >Our Vax 750 running 4.2BSD has occassionally been crashing with: >machine check 2: cp tbuf par fault [lots of registers] >panic: mchk >panic: sleep There are two interrelated fixes for this. Both are already in 4.3BSD. The first is that some tbuf parity errors can be corrected by flushing the translation buffer. As I recall, 4.2 has code to do this, but has the wrong test to determine whether it will suffice, masking with an 0xf somewhere where it should be masking with 0xe. The second is a `jelloware' (writable control store) fix for a timing problem in one CPU module. The 4.3 boot program knows to load the file `pcs750.bin' into the 750 patch store. The code to do this is not terribly large, and is all contained in /sys/stand/boot.c at your nearest 4.3 site, which also has /pcs750.bin. Incidentally, the `panic: sleep' is due to a bug in sleep that affects things only after a previous panic. I fixed this in our 4.2 kernels back when Jim O'Toole and I were writing a kernel XNS. I was rather amused to find the very same fix in the 4.3-alpha kernel. It helps considerably when you crash your machine several times a day! Also incidentally, the 4.3 boot program has no way to avoid loading the /pcs750.bin file, something I consider a bug (now that I have been bit by it). We recently had a 750 go down for two weeks. The long downtime was caused by three virtually simultaneous failures. First, one of two CDC9771 HDAs died suddenly. Second, our standby disk system (two RK07s) had some sort of controller backplane problem (considering how often we use the RK07s, it may have developed long ago). Third, and only discovered last Friday, our WCS board went out at the same time as the HDA. As long as I did not load the microcode update, the machine would boot. With the microcode in place, the machine would hang completely: not even control-P did anything. While this hardware failure might be quite rare, it forced me to consider what would happen if part of /pcs750.bin were overwritten. I added another boot flag to prevent the microcode update. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690) UUCP: seismo!mimsy!chris ARPA/CSNet: chris@mimsy.umd.edu
mangler@cit-vax.Caltech.Edu (System Mangler) (01/04/87)
In article <3645@sdcrdcf.UUCP> davem@sdcrdcf.UUCP (David Melman) writes: > Our Vax 750 running 4.2BSD has occassionally been crashing with: > machine check 2: cp tbuf par fault > va 80039728 errpc 8000394e mdr a smr 8 rdtimo 0 tbgpar 0 cacherr 5 > busserr 6 mcesr 9 pc 8000394e ps1 40c0008 mcsr 80016 In article <4891@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes: > There are two interrelated fixes for this. Both are already in > 4.3BSD. The first is that some tbuf parity errors can be corrected [...] Read the registers. This is a cache parity error, not a tbuf parity error. Never mind that 4.[23] doesn't distinguish between the two. We get these all the time. There are two ways to "fix" it: swap L0003 boards until you get a good one ($$$), or change the machine check handler to flush the cache and return. Now, can anyone tell me how to flush the cache? Don Speck speck@vlsi.caltech.edu {seismo,rutgers,ames}!cit-vax!speck
chris@mimsy.UUCP (Chris Torek) (01/04/87)
>In article <3645@sdcrdcf.UUCP> davem@sdcrdcf.UUCP (David Melman) writes: >>Our Vax 750 running 4.2BSD has occassionally been crashing with: >>machine check 2: cp tbuf par fault >> va 80039728 errpc 8000394e mdr a smr 8 rdtimo 0 tbgpar 0 cacherr 5 >> busserr 6 mcesr 9 pc 8000394e ps1 40c0008 mcsr 80016 >In article <4891@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes: >>There are two interrelated fixes for this. Both are already in >>4.3BSD. The first is that some tbuf parity errors can be corrected [...] In article <1419@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (System Mangler) writes: >Read the registers. This is a cache parity error, not a tbuf parity >error. Never mind that 4.[23] doesn't distinguish between the two. Sure enough. I never bothered to read the bits, knowing that `this occurs all the time and is always a tbuf error'. >We get these all the time. There are two ways to "fix" it: swap >L0003 boards until you get a good one ($$$), or change the machine >check handler to flush the cache and return. Now, can anyone tell >me how to flush the cache? Maybe the microcode fix helps this too? I have never seen a cache error here (but tb errors were extremely rare too: probably a consequence of our ordering our 750s with Ultrix 1.0 way back when.) Anyway, you could try disabling the cache: mtpr(CADR, 1); /* CADR is register 0x25 */ but that will probably slow the machine to a crawl. Disabling and reenabling the cache might well flush it, though. If mtpr(CADR, 1); mtpr(CADR, 0); does not clear the problem, perhaps reenabling it after a long delay will: mtpr(CADR, 1); timeout(cacheenable, (caddr_t) 0, 10*hz); ... cacheenable() { mtpr(CADR, 0); } But according to the registers I can read above (DEC's latest VAX Hardware Handbook does NOT include machine check frames---why?), returning may not help too much in this case, because the machine check error summary register (mcesr) has bit 8 set, bus error. Returning to the failed instruction may well not retry the failed read. Since it occurred in kernel mode, that might bring the machine down anyway. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690) UUCP: seismo!mimsy!chris ARPA/CSNet: chris@mimsy.umd.edu
dave@onfcanim.UUCP (Dave Martindale) (01/05/87)
In article <1419@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (System Mangler) writes: >We get these all the time. There are two ways to "fix" it: swap >L0003 boards until you get a good one ($$$), or change the machine >check handler to flush the cache and return. Now, can anyone tell >me how to flush the cache? At Waterloo, someone added code to set the "force cache miss" bits, then access the address that got the parity fault; the idea being that this might cause the bad cache entry to be cleared. Without any way to generate cache errors on demand it's hard to check whether the code really works as designed. However, it's been running on perhaps a dozen machines for a year or two, so it is at least benign. The code looks like this; it's for a 780 so the magic bits may be somewhere else on a 750: /* Force Cache Miss and Replace */ mtpr(SBIMT, mfpr(SBIMT) | 0x1e000); i = *(int *)mcf->mc8_vaviba; /* Access address */ /* Return to normal */ mtpr(SBIMT, mfpr(SBIMT) & ~0x1e000);
richards@uiucdcsb.UUCP (01/06/87)
Chris Torek asks: > (DEC's latest VAX Hardware Handbook does NOT include machine check > frames---why?) Why? I don't know. But, I found them in the new hardbound "VAX Architecture Reference Manual" (from Digital Press) in Appendix B, Implementation Dependencies. Paul Richards University of Illinois at Urbana-Champaign, Dept of Comp Sci UUCP: {pur-ee,convex,inhp4}!uiucdcs!richards ARPA: richards@b.cs.uiuc.edu
tropp@cthct.UUCP (Ulf Tropp) (01/08/87)
In article <4914@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >>In article <3645@sdcrdcf.UUCP> davem@sdcrdcf.UUCP (David Melman) writes: >>>Our Vax 750 running 4.2BSD has occassionally been crashing with: >>>machine check 2: cp tbuf par fault >>> va 80039728 errpc 8000394e mdr a smr 8 rdtimo 0 tbgpar 0 cacherr 5 >>> busserr 6 mcesr 9 pc 8000394e ps1 40c0008 mcsr 80016 > >Anyway, you could try disabling the cache: > > mtpr(CADR, 1); /* CADR is register 0x25 */ > >but that will probably slow the machine to a crawl. Disabling >and reenabling the cache might well flush it, though. If > > mtpr(CADR, 1); > mtpr(CADR, 0); > >does not clear the problem, perhaps reenabling it after a long >delay will. We had a lousy cache once that would cause a mchk approximately once an hour. Since DEC couldn't supply a new board in a week, I had plenty of time to test recovery code. What I did was essentially: mtpr(CADR,1); if(mcf->mc5_cacherr&0xe){ mtpr(CAER,0xf); /* fetch offending byte w/o cache */ if(mcf->mc5_va&0x80000000) i = *((char *)mcf->mc5_va); else i = fubyte(mcf->mc5_va); if(mfpr(CAER)&0xe){ return; /* run without cache */ } printf("Cache reenabled\n"); mtpr(CADR,0); } return; Probably not entirely correct, but id did seem to work: the sytem would mostly return orderly to the aborted instruction, sometimes going directly into a new mchk a couple of times. Anyway, does somebody know about which instructions that can be restarted? Shouldn't anyone that can generate a page fault? BTW, a comment in the 4.2 tbuf recovery code says "Should we use pc or errpc.." (when looking at the instruction to return to). Clearly it must be pc, since that is what we is returning to, so I changed the 4.2 code. In-Real-Life: Ulf Tropp Systems Administrator Dept. of Computer Engineering Chalmers Univ. of Technology S-412 96 Gothenburg Sweden UUCP: ..mcvax!enea!chalmers!cthct!tropp ARPA: tropp%cthct.uucp@seismo.CSS.GOV (?)
mouse@mcgill-vision.UUCP (01/11/87)
In article <4914@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes: > Anyway, you could try disabling the cache: > mtpr(CADR, 1); /* CADR is register 0x25 */ > but that will probably slow the machine to a crawl. In case anyone is curious....I just tried this. My check of how slow the machine was was a user level program that did while (! signaled) { count ++; } and asked for a SIGALRM in ten seconds. Then the speed measure is how far count gets. With the cache turned off it counted only half as far (plus or minus maybe 10%) as with the cache on. This was a 750 running MtXinu 4.3+NFS. der Mouse USA: {ihnp4,decvax,akgua,utzoo,etc}!utcsri!mcgill-vision!mouse think!mosart!mcgill-vision!mouse Europe: mcvax!decvax!utcsri!mcgill-vision!mouse ARPAnet: think!mosart!mcgill-vision!mouse@harvard.harvard.edu