jmleonar@CRDC.ARPA ("Dr. Joseph M. Leonard") (07/18/86)
We installed VMS V4.4 a few weeks ago, and have been noticing some unusual behavior: the system "goes away" about every 10-15 days... The machine IS 100% utilized (and is a 730), but there seems to be no unusual load condition when it happens. The symptom that is noticed is that all tty's become unresponsive, and no tty can initiate the login sequence (<cr>'s are ignored). The 100% load has been present since VMS 3.0, but the crashes are much more frequent under VMS 4.4... Is there some method of monitoring the system to watch for unusual situations? Is there something that must be done that is NOT indicated in the SYSGEN stuff? Has anybody else noticed this problem? Joe Leonard <jmleonar@crdc.arpa>
carl@CITHEX.CALTECH.EDU.UUCP (07/20/86)
The symptoms you describe sound very much like those we observed in commection with a bug in a third-party device driver. In our case, what was happening was that the driver went into an infinite loop and until it was somehow persuaded to exit from the loop, nothing else on the system got any cpu time, including the terminal driver. To find out if your problem is of this nature (software looping at high priority), you should do the following the next time the machine hangs: 1) Put the console subsystem in LOCAL mode and type a CONTROL-P on the console in order to halt the system (at least I think the CONTROL-P halts the 750; if not, then halt the system with the 'H' command). When the system halts, it displays the contents of the PC on the console. 2) Use the 'C' command to have the VAX continue execution 3) Repeat steps 1) and 2) until either: A) You conclude that the problem is not a tight loop (the value of the PC is not restricted to a handful of addresses when you halt the system). B) You have a good idea of the limits of the loop in which the machine is stuck. For a tight loop in a device driver, we're talking about on the order of ten to fifteen instructions in the loop; for other looping behavior, use your own judjment. 4) Halt the system again and examine all the registers; on a 780, you can do this with the console commands: EXAMINE PSL EXAMINE/INTERN/NEXT:4 0 5) Force the machine to bugcheck. The console commands: DEPOSIT PC=-1 DEPOSIT PSL=1F0000 CONTINUE will do this for you. 6) Reboot the system and use the ANALYZE/CRASH_DUMP utility to find out in which image the loop was occurring, if the problem was a loop, or to find out what, if anything, was wierd about the system when it hung. 7) If the software that's causing the problem originated at DEC, submit an SPR; for third-party software, use the supplier's corresponding bug-reporting scheme.
chris@mimsy.umd.edu.UUCP (07/20/86)
Another thing that can cause these symptoms is a continuous series of interrupts. This will usually show up as a small set of different PC addresses, since the machine gets to run the interrupt handler, but not much else. Incidentally, on a 780, you must indeed use the HALT command (`h' will do) to stop the processor; on the 750, the console loop is run by the main CPU, so simply typing ^P causes a halt. Also, on a 780, the easiest way to obtain a crash dump is to use `@crash'. On a 750, you cannot `DEPOSIT PC -1'; you must use `D/G F FFFFFFFF' instead. Chris