[mod.computers.vax] hung

cetron%utah-ced@UTAH-CS.ARPA (Ed Cetron) (11/06/86)

	I have got some very strange happenings on my microvax:

	

	I have several processess which do a lot of math as well as
a fair bit of graphics using the UIS$xxxx calls for my gpx...

	Every so often, these processes just hang - no ctrl-y, no
ctrl-t, nothing..... I can use sho proc/cont/id=... to observe these
and they ARE using up cpu time, but no i/o or page faults.

	All of the programs are written in straight forward fortran,
no AST or sys calls, (except calls to UIS$ccccc).  But they all hang
at the same place in the program (but at random times through the loop)
and all show a pc of 8008E204  which is obviously in some shareable
library since the map of the main offending program has no addresses
this high.  

	I tried to use the debugger once and when the process hung, I
could never get control to the debugger either.....

	I am totally stumped (too bad its not on RSX 'cause then I 
could zap the image and get a postmortem dump at least....)....

	I can't even stop these processes with stop/id=xxx since 
then neither sho proc/id= or sho users can find them, but sho sys
and monitor still can and they continue to run (eating CPU time).
My only solution is to reboot!!!!!

	If any one has ANY ideas, I would be ecstatic - if you 
want map's or the like, no problem....


	thanks in advance,

		ed cetron
		center for engineering design
		univ of utah

garry@cadif.DECnet ("CADIF::GARRY") (11/07/86)

In a recent article Ed Cetron said:
>
>	I have got some very strange happenings on my microvax:
>
>	I have several processess which do a lot of math as well as
>a fair bit of graphics using the UIS$xxxx calls for my gpx...

The first release of the GPX/UIS software is nutty as a fruitcake.

It is not worth trying to figure out what in the world it's doing. (It's
rumored that second release is on the way!)

garry wiegand   (garry%cadif-oak@cu-arpa.cs.cornell.edu)

(The opinions expressed above are indeed those of my company.)
------

carl@CITHEX.CALTECH.EDU (Carl J Lydick) (11/08/86)

   >  	All of the programs are written in straight forward fortran,
   >  no AST or sys calls, (except calls to UIS$ccccc).  But they all hang
   >  at the same place in the program (but at random times through the loop)
   >  and all show a pc of 8008E204  which is obviously in some shareable
   >  library since the map of the main offending program has no addresses
   >  this high.  
The first thing you shoud do, since you DO have the offending address
available, is to use the SYSTEM DUMP ANALYZER (SDA [invoked in this application
by ANALYZE/SYSTEM]) to find out what might be there.  I do notice on my
VAXStation, though, that the address you specified is sitting between the
value of the global symbol RMS	( = 8007740) and the address pointed to
by the global symbol EXE$GL_SYSMSG ( = 80002C04 and points to 80090C00)
Furthermore, the symbol SYS$GL_UIS ( = 80000EE8 and points to 800B3000)
would seem to indicate that to the extent the two machines have similar
hardware (since the UIS stuff has to get loaded early on, the rest of the
software you've got shouldn't affect this), your problem is not with UIS
but with RMS.  In particular, I suspect it has something to do with RMS's
handling of terminal mailboxes.

I've had similar problems using a plain old VT52 emulator on a 780 when
I do much with programs that want to grab broadcast messages before they
can get to your terminal.  There are two modes of hanging that I've seen:
	1)  The process that is getting between your terminal and broadcast
	    messages (in my case, generally TPU, in yours, UIS) suddenly
	    goes crazy, ignoring attempts to communicate with it, and generally
	    (in every case that I've checked) with an I/O request pending
	    on a mailbox, and user-mode AST's disabled.
	2)  Similar to 1), except that the situation arises when a subprocess
	    terminates, but the parent process doesn't wake up.  This seems
	    to have something to do with the parent disabling terminal
	    interrupts (control-C's, T's, and/or Y's).  In disabling these,
	    the parent process manages to disable ALL user-mode AST's, and
	    doesn't want to wake up when the child dies (or maybe the part
	    of the program grabbing broadcast messages grabs a process
	    termination message instead, and doesn't handle it properly.
The workaround I've used in these situations is:
	A)  Log in on another terminal and use the SDA to figure out 
	    which mailbox is the culprit;
	B)  Do a "SET HOST 0" to log in a job you don't care if you lose;
	C)  With this new process, spawn a SYNCHRONOUS (do NOT use the
	    /NOWAIT qualifier) subprocess.
	D)  Look for the definition of DCL$ATTACH_xxxxxxxx in the job
	    logical name table, and replace that definition with one
	    that references the mailbox that causing the original job
	    to hang.
	E)  Log out from the subprocess; at this point, the parent of
	    that job (the remote job you created with the "SET HOST"
	    command) will be ignoring interrupts from the terminal, will
	    be, in fact, acting just the way the original job was, with
	    one exception: the process that watches for control-Y's in
	    remote jobs will see the control-Y's and let you abort the 
	    remote job.
	F)  Verify that the original process is now responding, and log
	    out from the job you started to deal with the problem.