info-vax@ucbvax.ARPA (10/22/84)
From: RONNIE%MIT-EECS@MIT-MC.ARPA I am currently working on a 780 with 8meg. We have been experiencing machinechecks quite frequently and often during system startup causing a horrible loop of machinechecks. Does anyone out there know what should be in the stack when this halt occurs? It doesn't really resemble any of the normal exception stacks. I would really appreciate an answer as quickly as possible before this field circus guy replaces all of our memory boards without even looking at the dump or the errorlog. Thanks, -Ron (Not to be confused with the other -Ron) -Note: If you can get to me today, I will be at (315) 423-3876 -------
uucp@usc-cse.UUCP (10/22/84)
From: RONNIE%MIT-EECS@MIT-MC I am currently working on a 780 with 8meg. We have been experiencing machinechecks quite frequently and often during system startup causing a horrible loop of machinechecks. Does anyone out there know what should be in the stack when this halt occurs? It doesn't really resemble any of the normal exception stacks. I would really appreciate an answer as quickly as possible before this field circus guy replaces all of our memory boards without even looking at the dump or the errorlog. Thanks, -Ron (Not to be confused with the other -Ron) -Note: If you can get to me today, I will be at (315) 423-3876 -------
info-vax@ucbvax.ARPA (10/22/84)
From: NEWMAN%SAV@LLL-MFE.ARPA
Ron:
What follows is an example from a machine check we had here at SAIC Oak Ridge
just recently. When we got it I went looking through the hardware reference
manual and the Arichetecture handbook to find out what the stack looked like.
There is no documentation other than the fiche. This is a copy of what I sent
out to some of the other people in our company.
Hope this helps.
gkn
------------------------------------------
Arpa: Newman%SAV@LLL-MFE.Arpa
USPS: Gerard K. Newman
Science Applications International
800 Oak Ridge Turnpike
Oak Ridge, TN 37830
AT&T: (615) 482-9031
--------------------------------------------------------------------------------
From: GKN 15-OCT-1984 16:21
Subj: Machine checks on the 11/780
Having just had a machine check crash and discovering that the documentation
for the contents of the machine check logout stack is virtually non-existant
I thought I'd share with you what the stack looks like and a little about
what kinds of things cause a machine check.
All of this information is 11/780 specific; I havn't looked at the machine
check exception handlers for the 11/730 or 11/750. I suspect (though I
don't know for sure) that this information is valid for the 11/785 also.
Stack format:
@SP: 00000028 Number of bytes pushed onto the stack
008800F6 Machine check summary parameter
00010204 CPU error status
0000025B Trapped micro PC
800813E7 Virtual address at fault time
F4001D4B CPU "D" register
00000A01 Translation buffer status register 0
00000000 Translation buffer status register 1
00000000 Physical address causing SBI timeout
00001533 Cache parity error status
00004000 SBI error status
800054CE PC of instruction causing machine check
00C70000 PSL at fault time
These parameters are actually taken from our recent machine check.
We're most interested in the machine check summary paramter, the second
longword on the stack. As far as I can tell, only the low order two bytes
are significant. The low order byte contains the fault type in the low
order 4 bits; The high order 4 bits seem to be 1111 always. The next byte
contains the 'timeout pending flag', whatever that is.
The fault type codes are:
0 - CPU timeout/SBI error confirmation
1 - Control store parity error
2 - Translation buffer parity error
3 - Cache parity error
4 - Not used
5 - Read data substitute error
6 - 'Microcode can't get here' error
7 - Not used
8 - Not used
9 - Not used
A - IB detected translation buffer parity error
B - Not used
C - IB detected memory error
D - IB detected CPU timeout or SBI error confirmation
E - Not used
F - IB detected cache problem
As you can see, we just had a 'microcode can't get here' error, which gives
me one of those "warm fuzzy" feelings.
Anyway, the next time you get a machine check I hope this info is of some
use.
gkninfo-vax@ucbvax.ARPA (10/23/84)
From: Richard Garland <OC.GARLAND%CU20B@COLUMBIA.ARPA>
---------------
Mail-From: OC.GARLAND created at 22-Oct-84 22:04:35
Date: Mon 22 Oct 84 22:04:35-EDT
From: Richard Garland <OC.GARLAND@CU20B.ARPA>
Subject: Machine checks
To: ronnie%mit-eecs@MIT-MC.ARPA
cc: OC.GARLAND@CU20B.ARPA
-Ron
Short of the fiche, the Internals Manual is fairly good. The
new issuse is black covered from Digital Press "VAX/VMS Internals and
Data Structures". The chapter (8) mentions special cases and what happens
if a certain number of checks happens in a certain time.
There is also a field service book (A small format handbook they are given)
that is full of numbers and codes.
The guys at RDC are also usually *very* good at picking apart stacks
and codes. Often the local guy is too "proud" (stupid?) to call for
help, but a call to Colorado can often speed things up enormously.
Rg
-------
-------%SRI-KL.ARPA,%SRI-CSL:NEWMAN%SAV%LLL-MFE@randvax.UUCP (10/23/84)
From: NEWMAN%SAV@LLL-MFE
Ron:
What follows is an example from a machine check we had here at SAIC Oak Ridge
just recently. When we got it I went looking through the hardware reference
manual and the Arichetecture handbook to find out what the stack looked like.
There is no documentation other than the fiche. This is a copy of what I sent
out to some of the other people in our company.
Hope this helps.
gkn
------------------------------------------
Arpa: Newman%SAV@LLL-MFE.Arpa
USPS: Gerard K. Newman
Science Applications International
800 Oak Ridge Turnpike
Oak Ridge, TN 37830
AT&T: (615) 482-9031
--------------------------------------------------------------------------------
From: GKN 15-OCT-1984 16:21
Subj: Machine checks on the 11/780
Having just had a machine check crash and discovering that the documentation
for the contents of the machine check logout stack is virtually non-existant
I thought I'd share with you what the stack looks like and a little about
what kinds of things cause a machine check.
All of this information is 11/780 specific; I havn't looked at the machine
check exception handlers for the 11/730 or 11/750. I suspect (though I
don't know for sure) that this information is valid for the 11/785 also.
Stack format:
@SP: 00000028 Number of bytes pushed onto the stack
008800F6 Machine check summary parameter
00010204 CPU error status
0000025B Trapped micro PC
800813E7 Virtual address at fault time
F4001D4B CPU "D" register
00000A01 Translation buffer status register 0
00000000 Translation buffer status register 1
00000000 Physical address causing SBI timeout
00001533 Cache parity error status
00004000 SBI error status
800054CE PC of instruction causing machine check
00C70000 PSL at fault time
These parameters are actually taken from our recent machine check.
We're most interested in the machine check summary paramter, the second
longword on the stack. As far as I can tell, only the low order two bytes
are significant. The low order byte contains the fault type in the low
order 4 bits; The high order 4 bits seem to be 1111 always. The next byte
contains the 'timeout pending flag', whatever that is.
The fault type codes are:
0 - CPU timeout/SBI error confirmation
1 - Control store parity error
2 - Translation buffer parity error
3 - Cache parity error
4 - Not used
5 - Read data substitute error
6 - 'Microcode can't get here' error
7 - Not used
8 - Not used
9 - Not used
A - IB detected translation buffer parity error
B - Not used
C - IB detected memory error
D - IB detected CPU timeout or SBI error confirmation
E - Not used
F - IB detected cache problem
As you can see, we just had a 'microcode can't get here' error, which gives
me one of those "warm fuzzy" feelings.
Anyway, the next time you get a machine check I hope this info is of some
use.
gknKVC%CIT-VAX@%SRI-KL.ARPA,%SRI-CSL:engvax.UUCP (10/23/84)
From: engvax!KVC@cit-vax We've occasionally had some real sticky hardware problems... The kind where everything passes the diagnostics or different things fail the diags at different times. One of the things I've found out from sitting with the FE all night while he swapped boards is that quite often the problem ends up being a flakey power supply. This is true for CPU power supplies and disk drive power supplies. What I wanna know is, how come these guys don't check the %*&$%&*$ power supplies first?!?!?!?!?!?!?!? I really don't like sitting in a cold machine room to all hours of the morning watching someone yank my VAX apart only to discover, several hours later, that a flakey power supply was the cause of it all! Now I tell them to check the damn things first, then they can start pulling logic arrays... /Kevin Carosso engvax!kvc @ CIT-VAX.ARPA Hughes Aircraft Co.
info-vax@ucbvax.ARPA (10/25/84)
From: engvax!KVC@cit-vax Well, thanks to the recent discussion on INFO-VAX, I just got a machine check... Well, at least it couldn't have happened at a better time! Here I am all ready to apply my new-found expertise! Enough of that... Anyway, I picked the error out of the interrupt stack, and found I got a "Read Data Substitute" error. I got this same error listed in the error log just before the crash got logged. The sequence in the error log was: memory error: RDS machine check in exec mode: RDS (system didn't crash) machine check in kernel mode: RDS (down she went...) The timestamps on the above three errors indicate they happened closer together than the resolution of system time (.01 secs). Anyway, does anyone out there know what a "Read Data Substitute" error is? I understand all about machine checks now, but what's the error mean!? /Kevin Carosso engvax!kvc @ CIT-VAX.ARPA Hughes Aircraft Co.
padpowell@wateng.UUCP (PAD Powell) (10/25/84)
If you think that the little service book is available, have another
thought. I was informed by our local field service people that they
could get CANNED for even letting customers copy (Xerox) a page out of
the book. Dec Proprietary Information, trade secrets, and all that
junk...
Patrick ("Can I at least fondle it?") Powell%SRI-KL.ARPA:OC.GARLAND%CU20B%COLUMBIA@randvax.UUCP (10/29/84)
From: Richard Garland <OC.GARLAND%CU20B@COLUMBIA>
---------------
Mail-From: OC.GARLAND created at 22-Oct-84 22:04:35
Date: Mon 22 Oct 84 22:04:35-EDT
From: Richard Garland <OC.GARLAND@CU20B.ARPA>
Subject: Machine checks
To: ronnie%mit-eecs@MIT-MC.ARPA
cc: OC.GARLAND@CU20B.ARPA
-Ron
Short of the fiche, the Internals Manual is fairly good. The
new issuse is black covered from Digital Press "VAX/VMS Internals and
Data Structures". The chapter (8) mentions special cases and what happens
if a certain number of checks happens in a certain time.
There is also a field service book (A small format handbook they are given)
that is full of numbers and codes.
The guys at RDC are also usually *very* good at picking apart stacks
and codes. Often the local guy is too "proud" (stupid?) to call for
help, but a call to Colorado can often speed things up enormously.
Rg
-------
-------