[net.unix] Machine check, type 0

alan@drivax.UUCP (Alan Fargusson) (08/13/85)

Does anyone know what 'Machine check, type 0' is on a VAX 780?
The message also says 'CP read timeout fault' and prints some
registers, but I can't find any documentation around here that
describes things like that. We have goten four of them since
last Wed.

We are running 5.2.2 with 4M, one rm03 and one rp07.
-- 

Alan Fargusson.

{ ihnp4, amdahl, mot }!drivax!alan

mcferrin@inuxc.UUCP (P McFerrin) (09/02/85)

> Does anyone know what 'Machine check, type 0' is on a VAX 780?
> The message also says 'CP read timeout fault' and prints some
> registers, but I can't find any documentation around here that
> describes things like that. We have goten four of them since
> last Wed.
> 
> We are running 5.2.2 with 4M, one rm03 and one rp07.
> -- 
> 
> Alan Fargusson.
> 
> { ihnp4, amdahl, mot }!drivax!alan

This problem is caused by a device on the Unibus.
It occurrs when a device fails to reply with its vector address when requested
so during a bus-request/bus-grant protocol.  This can be a tough problem
to isolate.  I know of no proceedure to find the culprit except by
replacing one board at a time.

Another cause of this problem can be external wiring to a DR-11 interface
(parallel i/o).  Noise on the request A or B lines can cause the DR11
to withdraw in the middle of a bus-request/bus-grant protocol.

It is a hardware problem.

aps@decwrl.UUCP (Armando P. Stettner) (09/03/85)

> > Does anyone know what 'Machine check, type 0' is on a VAX 780?
> > The message also says 'CP read timeout fault' and prints some
> > registers, but I can't find any documentation around here that
> > describes things like that. We have goten four of them since
> > last Wed.
> > 
> > We are running 5.2.2 with 4M, one rm03 and one rp07.
> > -- 
> > 
> > Alan Fargusson.
> > 
> > { ihnp4, amdahl, mot }!drivax!alan
> 
> This problem is caused by a device on the Unibus.
> It occurrs when a device fails to reply with its vector address when requested
> so during a bus-request/bus-grant protocol.  This can be a tough problem
> to isolate.  I know of no proceedure to find the culprit except by
> replacing one board at a time.
> 
> Another cause of this problem can be external wiring to a DR-11 interface
> (parallel i/o).  Noise on the request A or B lines can cause the DR11
> to withdraw in the middle of a bus-request/bus-grant protocol.
> 
> It is a hardware problem.

Hi there.
There may have been more information in Alan Fargusson's message
that I did not see but P McFerrin is wrong in saying that the problem
is a UNIBUS device.  Don't confuse "read timeout" with UNIBUS timeouts
(which are handled somewhat differently on VAX-11/780's than PDP-11's).
A "CP read timeout fault" is caused when the cpu tried to do a data
reference and the bus control logic could not gain access to the SBI
or it received no response for so many cycles (or the requested
location [nexus?] responded with Busy (or something to that affect)
for so many cycles).  

In the case of a failed UNIBUS access, the UBA will cause an interrupt
and in 4.2 (Ultrix-32), an error message will be printed, and the system
will attempt to reset the UBA.  System V's actions do not seem straight
forwart to me (but I only looked at it for about 10 min) but it appears
to print out the contents of some registers and then return.

This is not to say that there couldn't be a problem with Alan's UBA ...
	aps.

mcferrin@inuxc.UUCP (P McFerrin) (09/06/85)

> > > Does anyone know what 'Machine check, type 0' is on a VAX 780?
> > > ....
> > > ....
> > > 
> > > { ihnp4, amdahl, mot }!drivax!alan
> > 
> > This problem is caused by a device on the Unibus.
> > It occurrs when a device fails to reply with its vector address when requested
> > so during a bus-request/bus-grant protocol.  This can be a tough problem
> > to isolate.  I know of no proceedure to find the culprit except by
> > replacing one board at a time.
> > 
> > Another cause of this problem can be external wiring to a DR-11 interface
> > (parallel i/o).  Noise on the request A or B lines can cause the DR11
> > to withdraw in the middle of a bus-request/bus-grant protocol.
> > 
> > It is a hardware problem.
> 
> Hi there.
> There may have been more information in Alan Fargusson's message
> that I did not see but P McFerrin is wrong in saying that the problem
> is a UNIBUS device.  Don't confuse "read timeout" with UNIBUS timeouts
> (which are handled somewhat differently on VAX-11/780's than PDP-11's).
> A "CP read timeout fault" is caused when the cpu tried to do a data
> reference and the bus control logic could not gain access to the SBI
> or it received no response for so many cycles (or the requested
> location [nexus?] responded with Busy (or something to that affect)
> for so many cycles).  
> 
> In the case of a failed UNIBUS access, the UBA will cause an interrupt
> and in 4.2 (Ultrix-32), an error message will be printed, and the system
> will attempt to reset the UBA.  System V's actions do not seem straight
> forwart to me (but I only looked at it for about 10 min) but it appears
> to print out the contents of some registers and then return.
> 
> This is not to say that there couldn't be a problem with Alan's UBA ...
> 	aps.

We have several VAXens here with most of them configured with a DR11-C
parallel interface connected to Datakit VCS.  If the cable connector from
the DR11-C is not plugged in all the way, it is a guarantee "CP Read time-out"
fault.  Unix them attempts a "warm-restart" but can't and prints "Power Fail
error".  Now the qusetion: is this an Unibus device causing the error??
It is reproducable on several of our systems.

jfs@ih1ap.UUCP (Jesse Fred Shumway) (09/06/85)

> > Does anyone know what 'Machine check, type 0' is on a VAX 780?
> > The message also says 'CP read timeout fault' and prints some
> > registers, but I can't find any documentation around here that
> > describes things like that. We have goten four of them since
> 
> This problem is caused by a device on the Unibus.
> It occurrs when a device fails to reply with its vector address when requested
> so during a bus-request/bus-grant protocol.  This can be a tough problem
> to isolate.  I know of no proceedure to find the culprit except by
> replacing one board at a time.
> 
> It is a hardware problem.


Yes, it is definitely a hardware problem. UNIX simply takes a
microcode machine check interrupt and prints several of the
privileged machine registers.

I don't want to be argumentative, but, I've never seen this
problem's locus to be a UNIBUS device. Usually, focusing on the
memory subsystem seems to effectively isolate the offending board.
Although, on one machine I know of DEC resorted to replacing the
SBI backplane in their attempts to get rid of these errors;  which
regrettably returned the following spring with the temperature
fluctuations that accompanied the annual air conditioning
shakedown.

Quoting from DEC's "VAX Maintenance Handbook", 1983 edition, EK-
VAXV2-HB-002; "CP refers to memory references explicitly requested
by microcode and whose address comes from VA". VA is the
microcode's virtual address register. While you're at it, get your
self a copy of the "UNIX System V Release 2, Error Message
Reference Manual, DEC Processors", 307-114 issue 2, from your AT&T
UNIX sales rep. With it, and a copy of the VAX processor register
layouts you can get a good feel for how the hardware is
misbehaving. Its really nice to graciously forgo the handwaving a
VAX maintenance person often delivers when asked, "Ah-gee, any
idea what's wrong?".  No? :-)

Jesse Fred Shumway	AT&T 	ihnp4!ih1ap!jfs	  (312) 510-7880

alan@drivax.UUCP (Alan Fargusson) (09/12/85)

> > > Does anyone know what 'Machine check, type 0' is on a VAX 780?
> > > The message also says 'CP read timeout fault' and prints some
> > > registers, but I can't find any documentation around here that
> > > describes things like that. We have goten four of them since

It looks like some of my postings didn't make it to the net. The problem
went away after I recompiled the device drivers. It looks like the System V
distrubution has a bug in it. The makefile for the device drivers does the
following:
	cc -O -I/usr/include -S gd.c
	ed - gd.s <../spl.ed
	/lib/c2 -y gd.s gd.os
	as -o gd.o gd.os
	rm -f gd.s gd.os
	ar rv ../lib2 gd.o

The 'cc' line has the -O flag set which seems to cause the optimiser to
insert some instructions which don't work correctly on device registers
on the VAX. This is obviously a mistake anyway because the optimiser is
used again on gd.s with the magic -y flag. Removing the -O from CFLAGS in
the makefile seems to fix all kinds of funny things with disk and tape drives.
-- 

Alan Fargusson.

{ ihnp4, amdahl, mot }!drivax!alan