[mod.computers.vax] SBI Faults

SIT.BUSH@CU20B.COLUMBIA.EDU.UUCP (03/02/87)

Has anyone seen this problem?  We have been getting a fair number of SBI faults
logged in the error log on a 785.  There do not seem to be any effects from
this except the error log entries.  The entires claim that the error is
an unexpected read fault by TR#8 (a massbus adapter).  The only massbus device
is a single TM78 which is not is use when the errors occur.  Digital Field
Service has tried replacing the RH780 boards and the memory boards.  This
has not made the problem go away.  Of course, it is an intermittent problem,
so the diagnostics never see it.  Has this happened to any systems you know of?
If so, how was it resolved?

- Nick Bush
  Sterling-Winthrop Research Institute

ARPA: SIT.BUSH@CU20B.COLUMBIA.EDU
BITNET: SIT.BUSH@CU20B
-------

dp@JASPER.PALLADIAN.COM.UUCP (03/03/87)

    Date: Mon 2 Mar 87 13:25:23-EST
    From: Nick Bush <SIT.BUSH@CU20B.COLUMBIA.EDU>

    Has anyone seen this problem?  We have been getting a fair number of SBI faults
    logged in the error log on a 785.  There do not seem to be any effects from
    this except the error log entries.  The entires claim that the error is
    an unexpected read fault by TR#8 (a massbus adapter).  The only massbus device
    is a single TM78 which is not is use when the errors occur.  Digital Field
    Service has tried replacing the RH780 boards and the memory boards.  This
    has not made the problem go away.  Of course, it is an intermittent problem,
    so the diagnostics never see it.  Has this happened to any systems you know of?
    If so, how was it resolved?

    - Nick Bush
      Sterling-Winthrop Research Institute

    ARPA: SIT.BUSH@CU20B.COLUMBIA.EDU
    BITNET: SIT.BUSH@CU20B
    -------


have them check the coaxial ribbon jumpers on the backplane. they fail eventualy (the
ususal corrosion problems, and a 12 insertion hard limit on wigiling them). The first
machine I had it happen on was showing bizzare memory errors (not so benign, the
machine crashed instead) Adding to the short life was the fact the crews in "Touch
Up" used to remove them to give the backplane a final dusting before shipping the
machines. they seemed to have about a 2.5 year life in my machine room... (after I
quit, I ran into my FS tech who told me when the second machine needed new ones.)
These cables are (as far as I know, but not having seen the back of a BI machine) are
only present on 780/782/785 series machines.

<dp>

tencati@JPL-VLSI.ARPA.UUCP (03/03/87)

Nick,

Please promise you won't laugh..

I had a problem where my 780 would bugcheck about 3 times a week.  DEC
Field Service escalated it to the point where I had 2 District guys sitting
in the computer room waiting for the system to crash so they could run dumps
and look at stuff.

It was very interesting because the device that was causing the problem was
an RM03 that was spun down and had been for about a month.

Here's what the problem was:

Our computer room is extra-cold due to another computer room sharing the A/C 
with us.  It is also a water-cooled system as opposed to freon or some other 
chemical.  This caused the humidity to be higher than usual in the computer
room.  Over an extended period of time, this humidity caused microscopic MOLD
to grow on the gold plate of a couple pins on the Massbus adapter for the RM03.

This mold would then cause electrical variations in on the board which would
cause it to write a bogus value into memory.  VMS would come along and try
to execute the instruction...BUGCHECK...

It took PAINSTAKING diligence on the part of DEC.  They showed me the mold
and I believed...  They replaced the "moldy" board and my system worked
fine (until the next problem), but they had fixed the bugcheck problem.

So as weird as it may seem, have them clean the pins on your boards.  It 
may solve your problem.  It couldn't hurt in any case.

Good Luck,

Ron Tencati
System Mgr, JPL-VLSI.ARPA

art@MITRE.ARPA.UUCP (03/04/87)

>Our computer room is extra-cold due to another computer room sharing the A/C 
>with us.  It is also a water-cooled system as opposed to freon or some other 
>chemical.  This caused the humidity to be higher than usual in the computer
>room.  Over an extended period of time, this humidity caused microscopic MOLD
 The use of water-cooled A/C should not in and of itself cause the
 humidity to be higher.  Sharing the A/C with another computer is
 probably the reason that the room is colder.  Being colder the
 humidity will be higher in your room than the other room.  You really
 should work on correcting your environment.  If you have poor
 environment you are just asking for troubles.  It is possible to
 balance the two rooms but it is not easy.  You need to determine the
 heat loads in the two rooms and control the air flow.  I have seen
 some automatic systems that will use a second thermostat to control
 air flow to the second room.


     
*
*---Art
*
*Arthur T. McClinton Jr.     ARPA: ART@MITRE.ARPA
*Mitre Corporation MS-Z305   Phone: 703-883-6356
*1820 Dolley Madison Blvd    Internal Mitre: ART@MWVMS or M10319@MWVM
*McLean, Va. 22102           DECUS DCS: MCCLINTON
*