geof@imagen.UUCP (Geoffrey Cooper) (08/18/88)
Since we went to SR9.7, we've had a recurring problem, but one that we have not been able to repeat on demand. Occasionally, one types a Unix command and either gets a "segmentation violation" or the Cshell abruptly terminates. Then the Cshell and all other Cshells active hang. New CSH's also hang, as do new BSH's. The Aegis shell still runs, and can be used to run at least some unix commands. A "ps ax" so derived indicates that the CSH's are in state "S". Sometimes logging out and logging in helps. Usually it is necessary to shut the node. It seems to me that the problem is mildly related to use of Vt100 and TCP. But I haven't hard evidence. Has anyone else had this problem? Is there a fix? - Geof -- UUCP: {decwrl,sun}!imagen!geof ARPA: imagen!geof@decwrl.dec.com
rees@CITI.UMICH.EDU (08/18/88)
Since we went to SR9.7, we've had a recurring problem, but one that we have not been able to repeat on demand. Occasionally, one types a Unix command and either gets a "segmentation violation" or the Cshell abruptly terminates. Then the Cshell and all other Cshells active hang. New CSH's also hang, as do new BSH's. The Aegis shell still runs, and can be used to run at least some unix commands. A "ps ax" so derived indicates that the CSH's are in state "S". This sounds like a corrupted acl cache. Get a traceback of one of the hung shells and see if it's in something that sounds like acl cache code. To really fix this, you need to flush the cache (with /etc/flush_cache). I don't recommend doing this unless you're sure there is a problem. Again, this is fixed in sr10 (which doesn't have an acl cache). -------
dbfunk@ICAEN.UIOWA.EDU (David B. Funk) (08/19/88)
> Since we went to SR9.7, we've had a recurring problem, but one that we > have not been able to repeat on demand. Occasionally, one types a Unix > command and either gets a "segmentation violation" or the Cshell > abruptly terminates. Then the Cshell and all other Cshells active > hang. New CSH's also hang, as do new BSH's. The Aegis shell still > runs, and can be used to run at least some unix commands. A "ps ax" > so derived indicates that the CSH's are in state "S". This sounds like an "/etc/passwd.map" problem. Many Domain/IX utilities, including the shells, need to be able to read the /etc/passwd.map file. This file provides a mapping between Aegis UIDs (PPO files) and Unix User IDs. If this file is unavailable (due to a node crash or network problem) or is out of sync (not updated with /etc/crpasswd) then it can cause the problems that are described above. A Unix shell (Bsh, csh) when started throws a read lock on the passwd.map file. If the node with the real "/etc" directory crashes the lock may be lost. If the passwd files (/etc/groups, /etc/passwd, & /etc/passwd.map) are then updated with crpasswd the stream to the old passwd.map file may be lost. This can happen easily with users who "never" log out. (IE log in on Monday and leave the same shells active all week.) To verify that this is your problem: do a "ps agu" in an Aegis shell and see if the user names show up in column 1. The ps listing will have blank spaces for the user names if ps can't "see" /etc/passwd. It is possible to have multiple copies of "/etc" on a system to increase availability but this causes increased sys_admin overhead. As the links pointing to it are static this will still not help active shells when a copy goes away. Again, this is fixed at sr10.