[comp.sys.ibm.pc.rt] recurring machine checks

dyer@spdcc.COM (Steve Dyer) (06/30/90)

After my RT 125 died (7c in LEDs at power up), I moved the
disks and peripherals to a desktop APC system.  It would crash
every few hours with a machine check.  I swapped and replaced
the APC CPU and memory boards without any material effect--it
woudl still crash unpredictably after a few hours of use.
I moved the disk and peripheral cards to a new floor standing
model and it's managed to crash twice with a machine check again.

I am truly stymied.  What *IS* common is the AT bus cards, specifically
the Megapel adapter, an ethernet card (which is essentially unused
right now), an Adaptec AHA1542A SCSI disk controller, and a buffered
4-port async card.  The EESDI controller has been different in each
machine, as has been the APC, and (occasionally) the memory, and these
seem unrelated to the problem.  I've rebuilt and replaced the kernel
image, just in case I was dealing with some odd form of image
corruption.  All the boards I mention seem to be working OK in
general--I mean, the SCSI controller is very heavily used, and it seems
to operate OK without munging disks.

Does anyone have any clues?  I'm about to shoot myself, since I'm
leaving in 36 hours for 5 weeks.



-- 
Steve Dyer
dyer@ursa-major.spdcc.com aka {ima,harvard,rayssd,linus,m2c}!spdcc!dyer
dyer@arktouros.mit.edu, dyer@hstbme.mit.edu

honey@citi.umich.edu (Peter Honeyman) (07/02/90)

bill webb in palo alto understands what's going on here, and offers the
following patch to ca/trap.c:

        switch (mcs_pcs) {
+       case MCS_CHECK|USER:
+       case MCS_CHECK:
+               printf("spurious machine check ignored\n"sp);
+               return;
        default:
                if (mcs_pcs & MCS_CHECK) {      /* machine check */

after seeing "spurious machine check ignored" printed a sufficient number
of times, you'll want to lose the printf and increment a counter.

	peter