[mod.computers.vax] VMS 4.4 going away

jmleonar@CRDC.ARPA ("Dr. Joseph M. Leonard") (07/18/86)

We installed VMS V4.4 a few weeks ago, and have been noticing some unusual
behavior: the system "goes away" about every 10-15 days...  The machine IS
100% utilized (and is a 730), but there seems to be no unusual load
condition when it happens.

The symptom that is noticed is that all tty's become unresponsive, and no
tty can initiate the login sequence (<cr>'s are ignored).  The 100% load
has been present since VMS 3.0, but the crashes are much more frequent
under VMS 4.4...

Is there some method of monitoring the system to watch for unusual situations?
Is there something that must be done that is NOT indicated in the SYSGEN
stuff?  Has anybody else noticed this problem?

                                                  Joe Leonard
                                              <jmleonar@crdc.arpa>

carl@CITHEX.CALTECH.EDU.UUCP (07/20/86)

The symptoms you describe sound very much like those we observed in commection
with a bug in a third-party device driver.  In our case, what was happening
was that the driver went into an infinite loop and until it was somehow
persuaded to exit from the loop, nothing else on the system got any cpu
time, including the terminal driver.  To find out if your problem is of
this nature (software looping at high priority), you should do the following
the next time the machine hangs:

1)  Put the console subsystem in LOCAL mode and type a CONTROL-P on the
    console in order to halt the system (at least I think the CONTROL-P
    halts the 750; if not, then halt the system with the 'H' command).
    When the system halts, it displays the contents of the PC on the
    console.
2)  Use the 'C' command to have the VAX continue execution
3)  Repeat steps 1) and 2) until either:
    A)  You conclude that the problem is not a tight loop (the value of
        the PC is not restricted to a handful of addresses when you halt
        the system).
    B)  You have a good idea of the limits of the loop in which the machine
	is stuck.  For a tight loop in a device driver, we're talking about
	on the order of ten to fifteen instructions in the loop; for other
        looping behavior, use your own judjment.
4)  Halt the system again and examine all the registers; on a 780, you can
    do this with the console commands:
	EXAMINE PSL
	EXAMINE/INTERN/NEXT:4 0
5)  Force the machine to bugcheck.  The console commands:
	DEPOSIT PC=-1
	DEPOSIT PSL=1F0000
	CONTINUE
    will do this for you.
6)  Reboot the system and use the ANALYZE/CRASH_DUMP utility to find out
    in which image the loop was occurring, if the problem was a loop, or
    to find out what, if anything, was wierd about the system when it hung.
7)  If the software that's causing the problem originated at DEC, submit
    an SPR; for third-party software, use the supplier's corresponding
    bug-reporting scheme.

chris@mimsy.umd.edu.UUCP (07/20/86)

Another thing that can cause these symptoms is a continuous series
of interrupts.  This will usually show up as a small set of different
PC addresses, since the machine gets to run the interrupt handler,
but not much else.

Incidentally, on a 780, you must indeed use the HALT command (`h'
will do) to stop the processor; on the 750, the console loop is
run by the main CPU, so simply typing ^P causes a halt.  Also, on
a 780, the easiest way to obtain a crash dump is to use `@crash'.
On a 750, you cannot `DEPOSIT PC -1'; you must use `D/G F FFFFFFFF'
instead.

Chris