[fa.info-vax] Machine checks

info-vax@ucbvax.ARPA (10/22/84)

From: RONNIE%MIT-EECS@MIT-MC.ARPA

I am currently working on a 780 with 8meg.  We have been experiencing 
machinechecks quite frequently and often during system startup causing
a horrible loop of machinechecks.  Does anyone out there know what
should be in the stack when this halt occurs?  It doesn't really resemble
any of the normal exception stacks.  I would really appreciate an answer
as quickly as possible before this field circus guy replaces all of our
memory boards without even looking at the dump or the errorlog.


					Thanks,
					-Ron (Not to be confused with the
							other -Ron)

-Note:  If you can get to me today, I will be at (315) 423-3876

-------

uucp@usc-cse.UUCP (10/22/84)

From: RONNIE%MIT-EECS@MIT-MC

I am currently working on a 780 with 8meg.  We have been experiencing 
machinechecks quite frequently and often during system startup causing
a horrible loop of machinechecks.  Does anyone out there know what
should be in the stack when this halt occurs?  It doesn't really resemble
any of the normal exception stacks.  I would really appreciate an answer
as quickly as possible before this field circus guy replaces all of our
memory boards without even looking at the dump or the errorlog.


					Thanks,
					-Ron (Not to be confused with the
							other -Ron)

-Note:  If you can get to me today, I will be at (315) 423-3876

-------

info-vax@ucbvax.ARPA (10/22/84)

From: NEWMAN%SAV@LLL-MFE.ARPA

Ron:

What follows is an example from a machine check we had here at SAIC Oak Ridge
just recently.  When we got it I  went looking through the hardware reference
manual and the Arichetecture handbook to find out what the stack looked like.

There is no documentation other than the fiche. This is a copy of what I sent
out to some of the other people in our company.

Hope this helps.


        gkn

------------------------------------------
Arpa:   Newman%SAV@LLL-MFE.Arpa
USPS:   Gerard K. Newman
        Science Applications International
        800 Oak Ridge Turnpike
        Oak Ridge, TN  37830
AT&T:   (615) 482-9031


--------------------------------------------------------------------------------
From:   GKN            15-OCT-1984 16:21
Subj:   Machine checks on the 11/780


Having just had a machine check crash and discovering that the documentation
for the contents of the machine check logout stack is virtually non-existant
I thought I'd share  with you what the stack  looks like and  a little about
what kinds of things cause a machine check.

All of this information is  11/780 specific;  I havn't looked at the machine
check exception  handlers for the  11/730 or  11/750.  I  suspect  (though I
don't know for sure) that this information is valid for the 11/785 also.

Stack format:

        @SP:            00000028        Number of bytes pushed onto the stack
                        008800F6        Machine check summary parameter
                        00010204        CPU error status
                        0000025B        Trapped micro PC
                        800813E7        Virtual address at fault time
                        F4001D4B        CPU "D" register
                        00000A01        Translation buffer status register 0
                        00000000        Translation buffer status register 1
                        00000000        Physical address causing SBI timeout
                        00001533        Cache parity error status
                        00004000        SBI error status
                        800054CE        PC of instruction causing machine check
                        00C70000        PSL at fault time


These parameters are actually taken from our recent machine check.

We're most  interested in the  machine check summary  paramter,  the  second
longword on the stack.  As far as I can tell, only the  low order  two bytes
are significant.  The  low order byte contains  the fault type  in  the  low
order 4 bits;  The high order 4 bits seem to be 1111 always.  The  next byte
contains the 'timeout pending flag', whatever that is.

The fault type codes are:

        0       - CPU timeout/SBI error confirmation
        1       - Control store parity error
        2       - Translation buffer parity error
        3       - Cache parity error
        4       - Not used
        5       - Read data substitute error
        6       - 'Microcode can't get here' error
        7       - Not used
        8       - Not used
        9       - Not used
        A       - IB detected translation buffer parity error
        B       - Not used
        C       - IB detected memory error
        D       - IB detected CPU timeout or SBI error confirmation
        E       - Not used
        F       - IB detected cache problem


As you can see, we  just had a 'microcode can't get here' error, which gives
me one of those "warm fuzzy" feelings.

Anyway, the next time  you get a machine  check I  hope this info is of some
use.



gkn

info-vax@ucbvax.ARPA (10/23/84)

From: Richard Garland <OC.GARLAND%CU20B@COLUMBIA.ARPA>

                ---------------

Mail-From: OC.GARLAND created at 22-Oct-84 22:04:35
Date: Mon 22 Oct 84 22:04:35-EDT
From: Richard Garland <OC.GARLAND@CU20B.ARPA>
Subject: Machine checks
To: ronnie%mit-eecs@MIT-MC.ARPA
cc: OC.GARLAND@CU20B.ARPA

-Ron
	Short of the fiche, the Internals Manual is fairly good.  The
new issuse is black covered from Digital Press "VAX/VMS Internals and
Data Structures".  The chapter (8) mentions special cases and what happens
if a certain number of checks happens in a certain time.

There is also a field service book (A small format handbook they are given)
that is full of numbers and codes.  

The guys at RDC are also usually *very* good at picking apart stacks
and codes.  Often the local guy is too "proud" (stupid?) to call for
help, but a call to Colorado can often speed things up enormously.

					Rg
-------
-------

%SRI-KL.ARPA,%SRI-CSL:NEWMAN%SAV%LLL-MFE@randvax.UUCP (10/23/84)

From: NEWMAN%SAV@LLL-MFE

Ron:

What follows is an example from a machine check we had here at SAIC Oak Ridge
just recently.  When we got it I  went looking through the hardware reference
manual and the Arichetecture handbook to find out what the stack looked like.

There is no documentation other than the fiche. This is a copy of what I sent
out to some of the other people in our company.

Hope this helps.


        gkn

------------------------------------------
Arpa:   Newman%SAV@LLL-MFE.Arpa
USPS:   Gerard K. Newman
        Science Applications International
        800 Oak Ridge Turnpike
        Oak Ridge, TN  37830
AT&T:   (615) 482-9031


--------------------------------------------------------------------------------
From:   GKN            15-OCT-1984 16:21
Subj:   Machine checks on the 11/780


Having just had a machine check crash and discovering that the documentation
for the contents of the machine check logout stack is virtually non-existant
I thought I'd share  with you what the stack  looks like and  a little about
what kinds of things cause a machine check.

All of this information is  11/780 specific;  I havn't looked at the machine
check exception  handlers for the  11/730 or  11/750.  I  suspect  (though I
don't know for sure) that this information is valid for the 11/785 also.

Stack format:

        @SP:            00000028        Number of bytes pushed onto the stack
                        008800F6        Machine check summary parameter
                        00010204        CPU error status
                        0000025B        Trapped micro PC
                        800813E7        Virtual address at fault time
                        F4001D4B        CPU "D" register
                        00000A01        Translation buffer status register 0
                        00000000        Translation buffer status register 1
                        00000000        Physical address causing SBI timeout
                        00001533        Cache parity error status
                        00004000        SBI error status
                        800054CE        PC of instruction causing machine check
                        00C70000        PSL at fault time


These parameters are actually taken from our recent machine check.

We're most  interested in the  machine check summary  paramter,  the  second
longword on the stack.  As far as I can tell, only the  low order  two bytes
are significant.  The  low order byte contains  the fault type  in  the  low
order 4 bits;  The high order 4 bits seem to be 1111 always.  The  next byte
contains the 'timeout pending flag', whatever that is.

The fault type codes are:

        0       - CPU timeout/SBI error confirmation
        1       - Control store parity error
        2       - Translation buffer parity error
        3       - Cache parity error
        4       - Not used
        5       - Read data substitute error
        6       - 'Microcode can't get here' error
        7       - Not used
        8       - Not used
        9       - Not used
        A       - IB detected translation buffer parity error
        B       - Not used
        C       - IB detected memory error
        D       - IB detected CPU timeout or SBI error confirmation
        E       - Not used
        F       - IB detected cache problem


As you can see, we  just had a 'microcode can't get here' error, which gives
me one of those "warm fuzzy" feelings.

Anyway, the next time  you get a machine  check I  hope this info is of some
use.



gkn

KVC%CIT-VAX@%SRI-KL.ARPA,%SRI-CSL:engvax.UUCP (10/23/84)

From: engvax!KVC@cit-vax

We've occasionally had some real sticky hardware problems...  The
kind where everything passes the diagnostics or different things
fail the diags at different times.  One of the things I've found out
from sitting with the FE all night while he swapped boards is that
quite often the problem ends up being a flakey power supply.  This
is true for CPU power supplies and disk drive power supplies.

What I wanna know is, how come these guys don't check the %*&$%&*$
power supplies first?!?!?!?!?!?!?!?   I really don't like sitting
in a cold machine room to all hours of the morning watching someone
yank my VAX apart only to discover, several hours later, that a flakey
power supply was the cause of it all!

Now I tell them to check the damn things first, then they can start
pulling logic arrays...

	/Kevin Carosso             engvax!kvc @ CIT-VAX.ARPA
	 Hughes Aircraft Co.

info-vax@ucbvax.ARPA (10/25/84)

From: engvax!KVC@cit-vax

Well, thanks to the recent discussion on INFO-VAX, I just got
a machine check...  Well, at least it couldn't have happened
at a better time!  Here I am all ready to apply my new-found
expertise!

Enough of that...

Anyway, I picked the error out of the interrupt stack, and found
I got a "Read Data Substitute" error.  I got this same error
listed in the error log just before the crash got logged.   The
sequence in the error log was:

	memory error:   RDS
	machine check in exec mode:  RDS  (system didn't crash)
	machine check in kernel mode: RDS (down she went...)

The timestamps on the above three errors indicate they happened
closer together than the resolution of system time (.01 secs).

Anyway, does anyone out there know what a "Read Data Substitute"
error is?  I understand all about machine checks now, but what's
the error mean!?

	/Kevin Carosso              engvax!kvc @ CIT-VAX.ARPA
	 Hughes Aircraft Co.

padpowell@wateng.UUCP (PAD Powell) (10/25/84)

If you think that the little service book is available, have another
thought.  I was informed by our local field service people that they
could get CANNED for even letting customers copy (Xerox) a page out of
the book.  Dec Proprietary Information, trade secrets, and all that
junk...

Patrick ("Can I at least fondle it?") Powell

%SRI-KL.ARPA:OC.GARLAND%CU20B%COLUMBIA@randvax.UUCP (10/29/84)

From: Richard Garland <OC.GARLAND%CU20B@COLUMBIA>

                ---------------

Mail-From: OC.GARLAND created at 22-Oct-84 22:04:35
Date: Mon 22 Oct 84 22:04:35-EDT
From: Richard Garland <OC.GARLAND@CU20B.ARPA>
Subject: Machine checks
To: ronnie%mit-eecs@MIT-MC.ARPA
cc: OC.GARLAND@CU20B.ARPA

-Ron
	Short of the fiche, the Internals Manual is fairly good.  The
new issuse is black covered from Digital Press "VAX/VMS Internals and
Data Structures".  The chapter (8) mentions special cases and what happens
if a certain number of checks happens in a certain time.

There is also a field service book (A small format handbook they are given)
that is full of numbers and codes.  

The guys at RDC are also usually *very* good at picking apart stacks
and codes.  Often the local guy is too "proud" (stupid?) to call for
help, but a call to Colorado can often speed things up enormously.

					Rg
-------
-------