dyer@spdcc.COM (Steve Dyer) (06/30/90)
After my RT 125 died (7c in LEDs at power up), I moved the disks and peripherals to a desktop APC system. It would crash every few hours with a machine check. I swapped and replaced the APC CPU and memory boards without any material effect--it woudl still crash unpredictably after a few hours of use. I moved the disk and peripheral cards to a new floor standing model and it's managed to crash twice with a machine check again. I am truly stymied. What *IS* common is the AT bus cards, specifically the Megapel adapter, an ethernet card (which is essentially unused right now), an Adaptec AHA1542A SCSI disk controller, and a buffered 4-port async card. The EESDI controller has been different in each machine, as has been the APC, and (occasionally) the memory, and these seem unrelated to the problem. I've rebuilt and replaced the kernel image, just in case I was dealing with some odd form of image corruption. All the boards I mention seem to be working OK in general--I mean, the SCSI controller is very heavily used, and it seems to operate OK without munging disks. Does anyone have any clues? I'm about to shoot myself, since I'm leaving in 36 hours for 5 weeks. -- Steve Dyer dyer@ursa-major.spdcc.com aka {ima,harvard,rayssd,linus,m2c}!spdcc!dyer dyer@arktouros.mit.edu, dyer@hstbme.mit.edu
honey@citi.umich.edu (Peter Honeyman) (07/02/90)
bill webb in palo alto understands what's going on here, and offers the following patch to ca/trap.c: switch (mcs_pcs) { + case MCS_CHECK|USER: + case MCS_CHECK: + printf("spurious machine check ignored\n"sp); + return; default: if (mcs_pcs & MCS_CHECK) { /* machine check */ after seeing "spurious machine check ignored" printed a sufficient number of times, you'll want to lose the printf and increment a counter. peter