roger@cogsci.ed.ac.uk (Roger Burroughes) (02/19/91)
PROBLEM: We have an apparent incompatibility between an MCP board and a Ciprico Rimfire 3220 SMD Disk Controller (supporting a CDC Sabre 1.2Gb disk) on a Sun 4/330 running SunOS 4.0.3 - causing repeated "panic: Text fault ... BAD TRAP" crashes. DETAILED DESCRIPTION: The original set-up was all Sun stuff (i.e internal disk plus MCP hardware) which ran reasonably smoothly. A few weeks later, the Rimfire controller and associated software was added (7.Dec.89), and all seemed well. Shortly after this (19.Dec.89), I got round to installing the MCP software (version 6.0), together with X.25 software (version 6.0) and Sun Coloured Book software (version 2.0.1) - again, all seemed to run smoothly after this. Our first related error occurred on 5.Jan.90 when we lost external comms with the error message "mcph0: xmit hung..." and - at the same time - a single "rfintr: Hard error..." report, which wasn't repeated, so we put it down to a stray cosmic ray :-). Our first crash happened on 8.Jan.90, with: le0: Transmission stopped le0: csr: 2e3<TINT,... BAD TRAP rlogin: Text fault kernel read fault... At this point, Sun suggested that we might need a newer revision level CPU (as an aside, the Lance Ethernet bug seen at the same time was later fixed). I believe the Sun was in the middle of dumping to a remote Exabyte drive when it crashed. Next crash was on 10.Jan.90, again while dumping to Exabyte: panic: Text fault BAD TRAP nfsd: Text fault kernel read fault at addr=0x0,... [A possible red herring emerged at this point, since we had another 4/330 which crashed with similar errors: BAD TRAP nfsd: Data fault kernel read fault at addr=0x1d2c,... - THIS 4/330 had a Rimfire controller (3223/3224) plus disks, but *NO* MCP board.] The system was now crashing every day or two, and there appeared to be a high - but not exclusive - correlation between crashes and dumps. Sent core dump to Sun, who could find no obvious problem - some SLIGHT indication that the problem lies with the MCP/X.25 software plus a TENTATIVE suggestion that the MCP board may not work well on a 16MIPS machine. Sun engineer checked over hardware (CPU & MCP boards), moved MCP board to slot 3, Rimfire controller to slot 2, and installed extra 16Mb memory (25.Jan.90). Had Rimfire controller replaced (8.Feb.90) as well as CPU board, but crashes still continued - so we assumed that the problem wasn't due to faulty hardware. Only one thing left to do - remove all third party hardware (i.e. Rimfire controller and disk). This was done on 6.Mar.90 - and the situation improved. No more crashes. The next thing Sun suggested was to reinstall the Rimfire controller & disk, and remove the MCP board - however this was not a viable option with our setup, and so was not done. We did not, unfortunately, try the slightly newer Rimfire controller (3223/3224) that we had on another machine The system has been reasonably stable in the MCP-but-no-Rimfire configuration, so we accepted the situation and left well alone. However, we now want to add an extra medium/large disk to the machine in question, and the obvious thing to do seems to be to re-install the Rimfire controller and add a disk to that - we'd rather not have to buy another SCSI disk. Has anyone seen similar interaction problems between an MCP board and a Rimfire controller? Anyone found a fix? Does anyone know if newer versions of hardware and/or software have cured this problem? Please reply by email, and I'll summarise if there's any interest. Roger Burroughes Phone: +44 31 650 4447 | University of Edinburgh UUCP: ...!uunet!mcvax!ukc!its63b!cogsci!roger | Centre for Cognitive Science ARPA: roger%cogsci.ed.ac.uk@nsfnet-relay.ac.uk| 2 Buccleuch Place JANET: roger@uk.ac.ed.cogsci | Edinburgh EH8 9LW Scotland