info-vax@ucbvax.ARPA (10/22/84)
From: RONNIE%MIT-EECS@MIT-MC.ARPA I am currently working on a 780 with 8meg. We have been experiencing machinechecks quite frequently and often during system startup causing a horrible loop of machinechecks. Does anyone out there know what should be in the stack when this halt occurs? It doesn't really resemble any of the normal exception stacks. I would really appreciate an answer as quickly as possible before this field circus guy replaces all of our memory boards without even looking at the dump or the errorlog. Thanks, -Ron (Not to be confused with the other -Ron) -Note: If you can get to me today, I will be at (315) 423-3876 -------
uucp@usc-cse.UUCP (10/22/84)
From: RONNIE%MIT-EECS@MIT-MC I am currently working on a 780 with 8meg. We have been experiencing machinechecks quite frequently and often during system startup causing a horrible loop of machinechecks. Does anyone out there know what should be in the stack when this halt occurs? It doesn't really resemble any of the normal exception stacks. I would really appreciate an answer as quickly as possible before this field circus guy replaces all of our memory boards without even looking at the dump or the errorlog. Thanks, -Ron (Not to be confused with the other -Ron) -Note: If you can get to me today, I will be at (315) 423-3876 -------
info-vax@ucbvax.ARPA (10/22/84)
From: NEWMAN%SAV@LLL-MFE.ARPA Ron: What follows is an example from a machine check we had here at SAIC Oak Ridge just recently. When we got it I went looking through the hardware reference manual and the Arichetecture handbook to find out what the stack looked like. There is no documentation other than the fiche. This is a copy of what I sent out to some of the other people in our company. Hope this helps. gkn ------------------------------------------ Arpa: Newman%SAV@LLL-MFE.Arpa USPS: Gerard K. Newman Science Applications International 800 Oak Ridge Turnpike Oak Ridge, TN 37830 AT&T: (615) 482-9031 -------------------------------------------------------------------------------- From: GKN 15-OCT-1984 16:21 Subj: Machine checks on the 11/780 Having just had a machine check crash and discovering that the documentation for the contents of the machine check logout stack is virtually non-existant I thought I'd share with you what the stack looks like and a little about what kinds of things cause a machine check. All of this information is 11/780 specific; I havn't looked at the machine check exception handlers for the 11/730 or 11/750. I suspect (though I don't know for sure) that this information is valid for the 11/785 also. Stack format: @SP: 00000028 Number of bytes pushed onto the stack 008800F6 Machine check summary parameter 00010204 CPU error status 0000025B Trapped micro PC 800813E7 Virtual address at fault time F4001D4B CPU "D" register 00000A01 Translation buffer status register 0 00000000 Translation buffer status register 1 00000000 Physical address causing SBI timeout 00001533 Cache parity error status 00004000 SBI error status 800054CE PC of instruction causing machine check 00C70000 PSL at fault time These parameters are actually taken from our recent machine check. We're most interested in the machine check summary paramter, the second longword on the stack. As far as I can tell, only the low order two bytes are significant. The low order byte contains the fault type in the low order 4 bits; The high order 4 bits seem to be 1111 always. The next byte contains the 'timeout pending flag', whatever that is. The fault type codes are: 0 - CPU timeout/SBI error confirmation 1 - Control store parity error 2 - Translation buffer parity error 3 - Cache parity error 4 - Not used 5 - Read data substitute error 6 - 'Microcode can't get here' error 7 - Not used 8 - Not used 9 - Not used A - IB detected translation buffer parity error B - Not used C - IB detected memory error D - IB detected CPU timeout or SBI error confirmation E - Not used F - IB detected cache problem As you can see, we just had a 'microcode can't get here' error, which gives me one of those "warm fuzzy" feelings. Anyway, the next time you get a machine check I hope this info is of some use. gkn
info-vax@ucbvax.ARPA (10/23/84)
From: Richard Garland <OC.GARLAND%CU20B@COLUMBIA.ARPA> --------------- Mail-From: OC.GARLAND created at 22-Oct-84 22:04:35 Date: Mon 22 Oct 84 22:04:35-EDT From: Richard Garland <OC.GARLAND@CU20B.ARPA> Subject: Machine checks To: ronnie%mit-eecs@MIT-MC.ARPA cc: OC.GARLAND@CU20B.ARPA -Ron Short of the fiche, the Internals Manual is fairly good. The new issuse is black covered from Digital Press "VAX/VMS Internals and Data Structures". The chapter (8) mentions special cases and what happens if a certain number of checks happens in a certain time. There is also a field service book (A small format handbook they are given) that is full of numbers and codes. The guys at RDC are also usually *very* good at picking apart stacks and codes. Often the local guy is too "proud" (stupid?) to call for help, but a call to Colorado can often speed things up enormously. Rg ------- -------
%SRI-KL.ARPA,%SRI-CSL:NEWMAN%SAV%LLL-MFE@randvax.UUCP (10/23/84)
From: NEWMAN%SAV@LLL-MFE Ron: What follows is an example from a machine check we had here at SAIC Oak Ridge just recently. When we got it I went looking through the hardware reference manual and the Arichetecture handbook to find out what the stack looked like. There is no documentation other than the fiche. This is a copy of what I sent out to some of the other people in our company. Hope this helps. gkn ------------------------------------------ Arpa: Newman%SAV@LLL-MFE.Arpa USPS: Gerard K. Newman Science Applications International 800 Oak Ridge Turnpike Oak Ridge, TN 37830 AT&T: (615) 482-9031 -------------------------------------------------------------------------------- From: GKN 15-OCT-1984 16:21 Subj: Machine checks on the 11/780 Having just had a machine check crash and discovering that the documentation for the contents of the machine check logout stack is virtually non-existant I thought I'd share with you what the stack looks like and a little about what kinds of things cause a machine check. All of this information is 11/780 specific; I havn't looked at the machine check exception handlers for the 11/730 or 11/750. I suspect (though I don't know for sure) that this information is valid for the 11/785 also. Stack format: @SP: 00000028 Number of bytes pushed onto the stack 008800F6 Machine check summary parameter 00010204 CPU error status 0000025B Trapped micro PC 800813E7 Virtual address at fault time F4001D4B CPU "D" register 00000A01 Translation buffer status register 0 00000000 Translation buffer status register 1 00000000 Physical address causing SBI timeout 00001533 Cache parity error status 00004000 SBI error status 800054CE PC of instruction causing machine check 00C70000 PSL at fault time These parameters are actually taken from our recent machine check. We're most interested in the machine check summary paramter, the second longword on the stack. As far as I can tell, only the low order two bytes are significant. The low order byte contains the fault type in the low order 4 bits; The high order 4 bits seem to be 1111 always. The next byte contains the 'timeout pending flag', whatever that is. The fault type codes are: 0 - CPU timeout/SBI error confirmation 1 - Control store parity error 2 - Translation buffer parity error 3 - Cache parity error 4 - Not used 5 - Read data substitute error 6 - 'Microcode can't get here' error 7 - Not used 8 - Not used 9 - Not used A - IB detected translation buffer parity error B - Not used C - IB detected memory error D - IB detected CPU timeout or SBI error confirmation E - Not used F - IB detected cache problem As you can see, we just had a 'microcode can't get here' error, which gives me one of those "warm fuzzy" feelings. Anyway, the next time you get a machine check I hope this info is of some use. gkn
KVC%CIT-VAX@%SRI-KL.ARPA,%SRI-CSL:engvax.UUCP (10/23/84)
From: engvax!KVC@cit-vax We've occasionally had some real sticky hardware problems... The kind where everything passes the diagnostics or different things fail the diags at different times. One of the things I've found out from sitting with the FE all night while he swapped boards is that quite often the problem ends up being a flakey power supply. This is true for CPU power supplies and disk drive power supplies. What I wanna know is, how come these guys don't check the %*&$%&*$ power supplies first?!?!?!?!?!?!?!? I really don't like sitting in a cold machine room to all hours of the morning watching someone yank my VAX apart only to discover, several hours later, that a flakey power supply was the cause of it all! Now I tell them to check the damn things first, then they can start pulling logic arrays... /Kevin Carosso engvax!kvc @ CIT-VAX.ARPA Hughes Aircraft Co.
info-vax@ucbvax.ARPA (10/25/84)
From: engvax!KVC@cit-vax Well, thanks to the recent discussion on INFO-VAX, I just got a machine check... Well, at least it couldn't have happened at a better time! Here I am all ready to apply my new-found expertise! Enough of that... Anyway, I picked the error out of the interrupt stack, and found I got a "Read Data Substitute" error. I got this same error listed in the error log just before the crash got logged. The sequence in the error log was: memory error: RDS machine check in exec mode: RDS (system didn't crash) machine check in kernel mode: RDS (down she went...) The timestamps on the above three errors indicate they happened closer together than the resolution of system time (.01 secs). Anyway, does anyone out there know what a "Read Data Substitute" error is? I understand all about machine checks now, but what's the error mean!? /Kevin Carosso engvax!kvc @ CIT-VAX.ARPA Hughes Aircraft Co.
padpowell@wateng.UUCP (PAD Powell) (10/25/84)
If you think that the little service book is available, have another thought. I was informed by our local field service people that they could get CANNED for even letting customers copy (Xerox) a page out of the book. Dec Proprietary Information, trade secrets, and all that junk... Patrick ("Can I at least fondle it?") Powell
%SRI-KL.ARPA:OC.GARLAND%CU20B%COLUMBIA@randvax.UUCP (10/29/84)
From: Richard Garland <OC.GARLAND%CU20B@COLUMBIA> --------------- Mail-From: OC.GARLAND created at 22-Oct-84 22:04:35 Date: Mon 22 Oct 84 22:04:35-EDT From: Richard Garland <OC.GARLAND@CU20B.ARPA> Subject: Machine checks To: ronnie%mit-eecs@MIT-MC.ARPA cc: OC.GARLAND@CU20B.ARPA -Ron Short of the fiche, the Internals Manual is fairly good. The new issuse is black covered from Digital Press "VAX/VMS Internals and Data Structures". The chapter (8) mentions special cases and what happens if a certain number of checks happens in a certain time. There is also a field service book (A small format handbook they are given) that is full of numbers and codes. The guys at RDC are also usually *very* good at picking apart stacks and codes. Often the local guy is too "proud" (stupid?) to call for help, but a call to Colorado can often speed things up enormously. Rg ------- -------