jeb@zeta.UUCP (John Berry) (07/23/85)
We are running VAX 11/750's with UNIX 4.2 BSD. We have just had DEC install REV 7 of the L0003 board, which we hoped would clear up the mchk 2 --- tbuf error problems. Well it has not. Can anyone out there in network land give me any insight to what is happening. DEC cannot find any problems when they run diagnostics. I am fairly new to reading the net so if this has been discussed in the past my apologies.
spaf@gatech.CSNET (Gene Spafford) (07/25/85)
In article <83@zeta.UUCP> jeb@zeta.UUCP (John Berry) writes: > >We are running VAX 11/750's with UNIX 4.2 BSD. We have just had DEC >install REV 7 of the L0003 board, which we hoped would clear up the >mchk 2 --- tbuf error problems. Well it has not. Can anyone out there >in network land give me any insight to what is happening. DEC cannot >find any problems when they run diagnostics. This is an old and frustrating problem. I've had it show up on at least 4 750's I've worked with. The problem is, indeed, with the L0003 board. Let me tell you how it has been explained to me (if anyone has a more detailed explanation, please let us know). DEC obtains chips for the L0003 board from a couple of different sources. I'm not sure if they subcontract the board out to another firm or not, but they end up with two different versions of the board which are identical in stated specs and (almost) identical in appearence. As far as acceptance goes, both versions of the board behave identically under VMS and all the regular field service diagnostics. HOWEVER, under Unix, due to the way certain things are done and timed, one version of the board will repeatedly generate tbuf parity faults that cannot be recovered from. The fix is to replace the board with a copy of the other version. Once we did that, our 750's in the lab which crashed an average of 10 times a day have only encountered one tbuf fault in 6 months. To get a good board may require many swaps and trials, because I have heard someone claim that you can't identify one of the bad boards except by unsoldering chips and looking at the lot numbers on the underside. I don't know the specific chips or how to identify which version of the board you have. Supposedly, this problem is well known in the Ultrix support group and some field service offices (along with the RA81 read/write board glitch and the Rev4/RL02 problem, and others) as one of the strange problems that only shows up when using Unix. Have your field service people contact the Ultrix support group. It is possible that the Ultrix group may even know of a supply of working L0003 boards for exactly this situation. Best of luck! -- Gene "4 months and counting" Spafford The Clouds Project, School of ICS, Georgia Tech, Atlanta GA 30332 CSNet: Spaf @ GATech ARPA: Spaf%GATech.CSNet @ CSNet-Relay.ARPA uucp: ...!{akgua,allegra,hplabs,ihnp4,linus,seismo,ulysses}!gatech!spaf
dcmartin@sun.uucp (David C. Martin) (07/27/85)
In article <654@gatech.CSNET> spaf@gatech.UUCP (Gene Spafford) writes: >In article <83@zeta.UUCP> jeb@zeta.UUCP (John Berry) writes: >> >>We are running VAX 11/750's with UNIX 4.2 BSD. We have just had DEC >>install REV 7 of the L0003 board, which we hoped would clear up the >>mchk 2 --- tbuf error problems. Well it has not. Can anyone out there >>in network land give me any insight to what is happening. DEC cannot >>find any problems when they run diagnostics. > >This is an old and frustrating problem. I've had it show up on at least >4 750's I've worked with. The problem is, indeed, with the L0003 >board. Let me tell you how it has been explained to me (if anyone has >a more detailed explanation, please let us know). Okay, I will. I already mailed John, but perhaps this could be rehashed one more time. The problem does lie in the L0003 board, but the solution is easy. VMS has microcode to alleviate these parity problems, and using the /boot program which reads microcode off the disk, the problem can be easily solved. Mike Karels wrote up a patch and we have been running it at UC Berkeley for quite some time with favorable results. If there is sufficient need, I will dig this up for those of you who need it, the microcode loading program was previously posted to the NET, so check your archives for that. -- David C. Martin - Sun Microsystems / UC Berkeley uucp: ..!ucbvax!sun!dcmartin usps: 2280 California St #8 arpa: dcmartin@Berkeley Mountain View, CA 94040 at&t: 415/960-7458 (O) - 415/967-0506 (H)
rees@apollo.uucp (Jim Rees) (07/30/85)
Actually, the bsd4.2 tape we got still didn't have the right stuff for tbuf parity errors. In machdep.c, the line if ((mcf->mc5_mcesr&0xf) == MC750_TBPAR) { should be if ((mcf->mc5_mcesr&0xe) == MC750_TBPAR) { The low order bit indicates whether the error occured in the execution buffer or not. Since we don't care where the error occured, we just mask out that bit. See the section on "Machine Check Error Summary Register" in the Vax Hardware Handbook. Some other early fixes caught the tbuf par error fine, but then failed to flush the buffer before returning, resulting in an endless loop. The 4.1 code was even worse. As I recall, hard errors were mistakenly reported as tbuf parity errors, and soft errors were ignored, but I could have that the wrong way around. As far as I know all versions of 4.2 fix this problem. I have all this on good authority, but if I'm wrong about the TB flush also flushing the XB, please set me straight.
grandi@noao.UUCP (Steve Grandi) (07/30/85)
> Okay, I will. I already mailed John, but perhaps this could be rehashed > one more time. The problem does lie in the L0003 board, but the solution > is easy. VMS has microcode to alleviate these parity problems, and > using the /boot program which reads microcode off the disk, the problem > can be easily solved. Mike Karels wrote up a patch and we have been running Unfortunately, loading the proper microcode is not the complete solution. Witness the following console output-- Jul 4 02:10 machine check 2: cp tbuf par fault va 802246f4 errpc 8001433b mdr 505 smr 8 rdtimo 0 tbgpar 3 cacherr 1 buserr 8 mcesr c pc 80014336 psl c00000 mcsr 80318 panic: mchk trap type 2, code = 0, pc = 80000fa2 panic: Reserved operand trap type 2, code = 0, pc = 80000fa2 panic: Reserved operand trap type 2, code = 0, pc = 80000fa2 panic: Reserved operand trap type 2, code = 0, pc = 80000fa2 panic: Reserved operand trap type 2, code = 0, pc = 80000fa2 panic: Reserved operand 4.2 BSD UNIX #5: Mon Jun 24 17:12:19 MST 1985 real mem = 5238784 avail mem = 4198400 using 231 buffers containing 524288 bytes of memory etc. Maybe the combination of microcode rev. 98 (which we are already using) and the rev. 7 L003 board (which will be installed Someday, Real Soon Now) will cure the problem and eliminate these irritating crashs. But I doubt it. Now the real question: Does anyone know why the system sometimes goes into the mchk/Reserved operand panic loop shown above instead of trying its normal recovery? This happens on about half of our tbuf parity faults. -- Steve Grandi, National Optical Astronomy Observatories, Tucson, AZ, 602-325-9228 {arizona,decvax,hao,ihnp4,seismo}!noao!grandi noao!grandi@lbl-csam.ARPA
mp@allegra.UUCP (Mark Plotnick) (07/31/85)
I think the reserved operand panic loop can be fixed by changing asm.sed so that spl1() [which is called in boot()] sets the priority to something higher than 0; as 4.2bsd was distributed, spl1() does the same thing as spl0(). I believe this is one of the old RWS@XX bug fixes.
spaf@gatech.CSNET (Gene Spafford) (08/01/85)
In article <2496@sun.uucp> dcmartin@sun.UUCP (David C. Martin) writes: >Okay, I will. I already mailed John, but perhaps this could be rehashed >one more time. The problem does lie in the L0003 board, but the solution >is easy. VMS has microcode to alleviate these parity problems, and >using the /boot program which reads microcode off the disk, the problem >can be easily solved. Mike Karels wrote up a patch and we have been running >it at UC Berkeley for quite some time with favorable results. If there is >sufficient need, I will dig this up for those of you who need it, the microcode >loading program was previously posted to the NET, so check your archives for >that. > Nope, that isn't the whole fix. The microcode fix only cures about 1/4 to 1/3 of the tbuf crashes (from our experience with the 3 750s in our lab). I installed the microcode-loading boot just about a week after the machines came in, and it didn't cure the problem. The new microcode fixes a different bug that causes tbuf faults. Also, before anyone posts something about how the whole thing can be cured by a patch to the machine check processing code -- I know about that patch too, and it doesn't fix the problem. To repeat, the problem is a well known HARDWARE problem, and if your field service people don't believe it, tell them to call the Ultrix support center for confirmation; everybody there should know all about the problem. Most of the old boards with the bad lot of chips (I have been told that the only way to identify some of them is to unsolder the chips and read the lot numbers off the bottom) have been replaced or installed in VMS systems where the problem will go unnoticed. Unfortunately, some field service people don't know about the problem, or blame it on Unix (because they don't understand). One site I know of had the field engineer swap out the L0003 board twice, and the problem didn't go away. He claimed that it had to be Unix, and as a non-supported product he was not responsible for anything else. The problem was that the two boards he swapped out were spares that had been sitting at the local office for months, and they had the faulty chips. Don't let this happen to you! -- Gene "4 months and counting" Spafford The Clouds Project, School of ICS, Georgia Tech, Atlanta GA 30332 CSNet: Spaf @ GATech ARPA: Spaf%GATech.CSNet @ CSNet-Relay.ARPA uucp: ...!{akgua,allegra,hplabs,ihnp4,linus,seismo,ulysses}!gatech!spaf
chris@umcp-cs.UUCP (Chris Torek) (08/01/85)
There is another problem with panic: you can get bogus "panic: sleep"s.
I fixed this a while back. In /sys/sys/kern_synch.c, change the top of
sleep() to look like this:
sleep(chan, pri)
caddr_t chan;
int pri;
{
register struct proc *rp, **hp;
register s;
rp = u.u_procp;
s = spl6();
if (panicstr) {
/*
* Let interrupts in for a moment, then just return.
* The splnet() really ought to be spl0(), but I'm
* too timid to do that.
*/
(void) splnet();
splx(s);
return;
}
if (chan == 0 || rp->p_stat != SRUN || rp->p_rlink)
panic("sleep");
.
.
.
(The splnet lets network interrupts through, so that the network
disk stuff (remote mount file systems) can finish syncing. splnet
is also < spl6, so disk interrupts get through too.)
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251)
UUCP: seismo!umcp-cs!chris
CSNet: chris@umcp-cs ARPA: chris@maryland