[mod.computers.vax] catch-22

info-vax@ucbvax.UUCP (11/13/85)
I hate being mis-informed.  I'm sure you do too.  My original note
contained some very wrong and misleading information.

I'm sorry.

Herewith an explanation...

>Approximately one month after upgrading to VMS 4.2, system started crashing
>at sporadic intervals with the fatal bugcheck: "unknown signal in ACP".  Within
>a few days, that fatal bugcheck came more often, and the system was also
>crashing with the fatal bugcheck  "illegal page fault with IPL too high",
>although occasionally system would just hang.  Currently, the system
>crashes with a fatal bugcheck during or immediately after rebooting from
>a previous fatal bugcheck.

All too true.

>Local diagnostics as well as remote diagnostics (by DDC at Colorado
>Springs) showed no evidence of hardware problems.  DDC indicated this is a
>VMS 4.2 problem, and suggested the work-around was to -- on a weekly basis
>-- completely backup the systems disk (and other disks as required), then
>rebuild by restoring from backup tapes.

That's what I heard initially from DDC, and erronously passed on.  And our
local Digital field engineers took their cue from DDC ...

What happened was that DDC didn't initially run a complete set of diagnostics
on our system.  They too quickly diagnosed our problem as the wrong one.

There was a bug -- quite familiar to DDC -- called the "XQP" bug, in
VMS4 up until VMS4.2 that exhibits similar (though usually far less drastic)
symptoms, and is indeed related to disk fragmentation and the workaround
they usually recommend is a restore from backups.  However, none of the
people I dealt with at DDC had been told the problem was fixed in 4.2.

(More details on the "XQP" and related bugs below.)

The real problem with our 750 finally turned out to be a rare, intermittent
CPU hardware problem, whih took DDC the quite a while to get a fix on
...after they were able to verify that it wasn't a VMS bug.

>Note: SYS$UPDATE:STABACKIT.COM in VMS 4.2 as distributed will not build
>a stand-alone backup kit on TU58 cartridges on a VAX-11/750 ...

True.  The work-around has been pointed in INFO-VAX several times over
the past few weeks, and every DEC field office should also know by now.
Hopefully, this should now be common knowledge, and it's guaranteed
to be fixed in VMS4.3.

More details ...

>The fatal bugchecking is apparently a serious VMS 4.2 bug related to disk
>fragmentation.  A similar problem may appear in VMS 4.1, although the
>symptoms are repeated simple hangs of the operating system, rather than
>the fatal bugchecks...

Here is a more complete explanation, as I understand it:

In VMS4.0 & 4.1 (possibly also in 3.x?) there exists a bug in the FILES-11
XQP (the executive QIO processor which acts like an ACP, though actually a
kernal-mode thread) which handles multi-header (fragmented) files poorly.
Under certain conditions with a fragmented disk, this bug may cause fatal
bugchecks "unknown signal name in ACP" at random intervals, perhaps as
often as once or twice a day.  Possibly the system may hang occasionally
instead of bugchecking.  A work-around is to rebuild the offending disk
from backups which (since files get written onto and read back from the
tape contiguously) fixes the fragmentation.

DDC has seen a lot of this "XQP" bug, and several modifications to VMS were
made earlier this year, the last (and final one) in June, in time to be
incorporated into production distributions of VMS4.2 (although not everyone
at DDC knew it had been fixed...).

If you're running VMS4.2 and getting an occasional "unknown signal name in
ACP", the most likely explanation is not the "XQP" bug, but a currently-
outstanding problem with the lock manager.  The problem occurs in all VMS
4.x versions, but shows up more often in 4.2 which does more extensive
locking/unlocking.   In brief: the lock manager sometimes fails to properly
upconvert a particular lock to a higher mode, but signals success anyways.
Current versions of VMS trap possibly-suspicious occurances with the
"unknown signal name in ACP" bugcheck.  The workaround is a simple increase
of the LOCKIDTBL and REHASHTBL system parameters.  Of course, there are
other possible reasons for that bugcheck ... so I definitely recommend
checking with Digital before trying to fix it yourself by fiddling with
those parameters.

Lessons I have learned from all this ...

	Various people at Digital do read Info-VAX (or one of its
	subsidary distributions).  While you may not see an
	official reply here, they do indeed take notice of what
	we're discussing.

	The DDC is capable of doing a very comprehensive diagnosis
	on a machine with even minimal functionality (when they had
	to go in the last time the machine had gotten very buggy
	and VMS wouldn't stay up at all).  However, I suspect that
	they're too often rushed and may give too superficial
	a diagnosis.

	For our site, self maintenance (SMS) on software is not good
	enough if looks like we've run across a serious VMS bug.
	We're going to BSS on software; I need the telephone support
	for our Digital software as well as for the hardware (especially
	if there's a doubt about which end the problem is on).

Bob Cunningham  {dual|vortex|ihnp4}!islenet!bob
Hawaii Institute of Geophysics, University of Hawaii