info-vax@ucbvax.UUCP (11/13/85)
I hate being mis-informed. I'm sure you do too. My original note contained some very wrong and misleading information. I'm sorry. Herewith an explanation... >Approximately one month after upgrading to VMS 4.2, system started crashing >at sporadic intervals with the fatal bugcheck: "unknown signal in ACP". Within >a few days, that fatal bugcheck came more often, and the system was also >crashing with the fatal bugcheck "illegal page fault with IPL too high", >although occasionally system would just hang. Currently, the system >crashes with a fatal bugcheck during or immediately after rebooting from >a previous fatal bugcheck. All too true. >Local diagnostics as well as remote diagnostics (by DDC at Colorado >Springs) showed no evidence of hardware problems. DDC indicated this is a >VMS 4.2 problem, and suggested the work-around was to -- on a weekly basis >-- completely backup the systems disk (and other disks as required), then >rebuild by restoring from backup tapes. That's what I heard initially from DDC, and erronously passed on. And our local Digital field engineers took their cue from DDC ... What happened was that DDC didn't initially run a complete set of diagnostics on our system. They too quickly diagnosed our problem as the wrong one. There was a bug -- quite familiar to DDC -- called the "XQP" bug, in VMS4 up until VMS4.2 that exhibits similar (though usually far less drastic) symptoms, and is indeed related to disk fragmentation and the workaround they usually recommend is a restore from backups. However, none of the people I dealt with at DDC had been told the problem was fixed in 4.2. (More details on the "XQP" and related bugs below.) The real problem with our 750 finally turned out to be a rare, intermittent CPU hardware problem, whih took DDC the quite a while to get a fix on ...after they were able to verify that it wasn't a VMS bug. >Note: SYS$UPDATE:STABACKIT.COM in VMS 4.2 as distributed will not build >a stand-alone backup kit on TU58 cartridges on a VAX-11/750 ... True. The work-around has been pointed in INFO-VAX several times over the past few weeks, and every DEC field office should also know by now. Hopefully, this should now be common knowledge, and it's guaranteed to be fixed in VMS4.3. More details ... >The fatal bugchecking is apparently a serious VMS 4.2 bug related to disk >fragmentation. A similar problem may appear in VMS 4.1, although the >symptoms are repeated simple hangs of the operating system, rather than >the fatal bugchecks... Here is a more complete explanation, as I understand it: In VMS4.0 & 4.1 (possibly also in 3.x?) there exists a bug in the FILES-11 XQP (the executive QIO processor which acts like an ACP, though actually a kernal-mode thread) which handles multi-header (fragmented) files poorly. Under certain conditions with a fragmented disk, this bug may cause fatal bugchecks "unknown signal name in ACP" at random intervals, perhaps as often as once or twice a day. Possibly the system may hang occasionally instead of bugchecking. A work-around is to rebuild the offending disk from backups which (since files get written onto and read back from the tape contiguously) fixes the fragmentation. DDC has seen a lot of this "XQP" bug, and several modifications to VMS were made earlier this year, the last (and final one) in June, in time to be incorporated into production distributions of VMS4.2 (although not everyone at DDC knew it had been fixed...). If you're running VMS4.2 and getting an occasional "unknown signal name in ACP", the most likely explanation is not the "XQP" bug, but a currently- outstanding problem with the lock manager. The problem occurs in all VMS 4.x versions, but shows up more often in 4.2 which does more extensive locking/unlocking. In brief: the lock manager sometimes fails to properly upconvert a particular lock to a higher mode, but signals success anyways. Current versions of VMS trap possibly-suspicious occurances with the "unknown signal name in ACP" bugcheck. The workaround is a simple increase of the LOCKIDTBL and REHASHTBL system parameters. Of course, there are other possible reasons for that bugcheck ... so I definitely recommend checking with Digital before trying to fix it yourself by fiddling with those parameters. Lessons I have learned from all this ... Various people at Digital do read Info-VAX (or one of its subsidary distributions). While you may not see an official reply here, they do indeed take notice of what we're discussing. The DDC is capable of doing a very comprehensive diagnosis on a machine with even minimal functionality (when they had to go in the last time the machine had gotten very buggy and VMS wouldn't stay up at all). However, I suspect that they're too often rushed and may give too superficial a diagnosis. For our site, self maintenance (SMS) on software is not good enough if looks like we've run across a serious VMS bug. We're going to BSS on software; I need the telephone support for our Digital software as well as for the hardware (especially if there's a doubt about which end the problem is on). Bob Cunningham {dual|vortex|ihnp4}!islenet!bob Hawaii Institute of Geophysics, University of Hawaii