[comp.unix.ultrix] Emacs hanging on DEC3100, possibly in "rmail"

alg@venture.cs.cornell.edu (Anne Louise Gockel) (07/31/90)
A user in our department has a problem that causes emacs to hang regularly
(but not on demand).  The problem is possibly associated with using "rmail" in
emacs, possibly with starting a shell in emacs.

If you think you have seen this problem in similar circumstances, please let
me know.  I do not know if this problem is unique to the single user or
widespread.  If you can shed any light on the problem, please let me know.

Configuration:
	DEC3100, Ultrix UWS 2.2, MIT's X11R4 (server, twm, and clients), 
		emacs 18.55.2 (happened with 18.54 also).
	/usr/spool/mail NFS mounted from Sun 4.0 file system
	/usr NFS mounted from a Sun 4.0 file system
	emacs run from /usr/local, NFS mounted from Sun 4.0 filesystem.
	emacs lock files in /tmp, local to DECstation
	DECstation is a YP client
	emacs compiled with X11 support; it comes up in it's own X window.

Symptoms:

One emacs process starts chewing up CPU (70-80%) and cannot be stopped or
interrupted.  A parent emacs process is hung in disk wait.  I cannot kill
these processes except with "kill -9".  I've tried to get a core dump of them,
but cannot get one that's very meaningful.

The following shows the output of the emacs-related processes.  There's a 
"ps -auxww" and "ps -clxa" listing.   "emacs-debug" is a version of emacs built
with "-g" and no "-O".

It appears that the parent process is the one hung in disk wait and the child
is spinning away (maybe in a spin lock?)  This setup does not make sense to
me, is it typical of "rmail" in emacs?


USER       PID %CPU %MEM   SZ  RSS TT STAT  TIME COMMAND
rz       18014 79.5  0.7 2172   28 co R    27:49 emacs-debug
rz       18011  0.0  1.4  388   64 p1 I     0:00 /usr/local/lib/emacs/etc/loadst -n 60
rz       18010  0.0  0.0    0    0 co DW    0:00 ???? (emacs-debug)
-----------------------
      F UID   PID  PPID CP PRI NI ADDR  SZ  RSS WCHAN STAT TT  TIME COMMAND
1300c000 442 18010     1  8  -1  0    0   0    0 97e8c DW   co  0:00 emacs-debug
12008201 442 18011 18010  1  15  0  7e3 296   28 fc000 I    p1  0:00 loadst
2009001 442 18014 18010195  73  0  5921600   20       R    co 27:40 emacs-debug

Looking through the mail logs, it is doubtful, but possible, that the user
received mail at the same time as he issued the "rmail" command.

We tried to figure out what the process 18010 was "disk waiting" on.  We
figured that it was a NFS file and we tried to track the ethernet packets.
After looking at some of the packets it appeared that the machine was issuing
NFS RFS_READLINK and RFS_GETATTR calls for /usr /usr/spool /usr/spool/mail and
/usr/spool/mail/rz.  These seemed to be repeated at regular intervals of a few
seconds.

We have seen NFS caching problems between Suns that are sometimes solved by
unmounting the bad filesystem (even though the umount fails, the cache is
cleared).  This trick did not change anything.

If anyone has any insights or has experienced similar problems, please let me
know.
						Thanks,
						Anne Louise Gockel
						Cornell Computer Science
Internet: alg@cs.cornell.edu		UUCP: cornell!alg