[comp.sys.apollo] Hanging CSH

geof@imagen.UUCP (Geoffrey Cooper) (08/18/88)

Since we went to SR9.7, we've had a recurring problem, but one that we
have not been able to repeat on demand.  Occasionally, one types a Unix
command and either gets a "segmentation violation" or the Cshell
abruptly terminates.  Then the Cshell and all other Cshells active
hang.  New CSH's also hang, as do new BSH's.  The Aegis shell still
runs, and can be used to run at least some unix commands.  A "ps ax"
so derived indicates that the CSH's are in state "S".

Sometimes logging out and logging in helps.  Usually it is necessary to
shut the node.

It seems to me that the problem is mildly related to use of Vt100 and
TCP.  But I haven't hard evidence.

Has anyone else had this problem?  Is there a fix?

- Geof
-- 
UUCP: {decwrl,sun}!imagen!geof ARPA: imagen!geof@decwrl.dec.com

rees@CITI.UMICH.EDU (08/18/88)

    Since we went to SR9.7, we've had a recurring problem, but one that we
    have not been able to repeat on demand.  Occasionally, one types a Unix
    command and either gets a "segmentation violation" or the Cshell
    abruptly terminates.  Then the Cshell and all other Cshells active
    hang.  New CSH's also hang, as do new BSH's.  The Aegis shell still
    runs, and can be used to run at least some unix commands.  A "ps ax"
    so derived indicates that the CSH's are in state "S".

This sounds like a corrupted acl cache.  Get a traceback of one of the
hung shells and see if it's in something that sounds like acl cache code.

To really fix this, you need to flush the cache (with /etc/flush_cache).
I don't recommend doing this unless you're sure there is a problem.

Again, this is fixed in sr10 (which doesn't have an acl cache).
-------

dbfunk@ICAEN.UIOWA.EDU (David B. Funk) (08/19/88)

> Since we went to SR9.7, we've had a recurring problem, but one that we
> have not been able to repeat on demand.  Occasionally, one types a Unix
> command and either gets a "segmentation violation" or the Cshell
> abruptly terminates.  Then the Cshell and all other Cshells active
> hang.  New CSH's also hang, as do new BSH's.  The Aegis shell still
> runs, and can be used to run at least some unix commands.  A "ps ax"
> so derived indicates that the CSH's are in state "S".

    This sounds like an "/etc/passwd.map" problem.
Many Domain/IX utilities, including the shells, need to be able to
read the /etc/passwd.map file.
This file provides a mapping between Aegis UIDs (PPO files) and
Unix User IDs. If this file is unavailable (due to a node crash
or network problem) or is out of sync (not updated with /etc/crpasswd)
then it can cause the problems that are described above.
A Unix shell (Bsh, csh) when started throws a read lock on the
passwd.map file. If the node with the real "/etc" directory crashes
the lock may be lost. If the passwd files (/etc/groups, /etc/passwd,
& /etc/passwd.map) are then updated with crpasswd the stream
to the old passwd.map file may be lost. This can happen easily
with users who "never" log out. (IE log in on Monday and leave
the same shells active all week.)
     To verify that this is your problem: do a "ps agu" in an
Aegis shell and see if the user names show up in column 1. 
The ps listing will have blank
spaces for the user names if ps can't "see" /etc/passwd.
    It is possible to have multiple copies of "/etc" on a
system to increase availability but this causes increased
sys_admin overhead. As the links pointing to it are static
this will still not help active shells when a copy goes away.
Again, this is fixed at sr10.