enteles@tahoe.unr.edu (Philip Enteles) (10/30/90)
We are having some trouble at our site and I would like to know if anyone has had a similar problem. We are running 4.3 BSD on a Sperry 7000-40. This system is the backbone of our undergraduate computer science department and has about 500 users. The system itself seems to run fine however left to its own devices it accumulates a number of rlogind processes that become hung. We are not sure where these processes are coming from but they only effect ptys. These stay hung until they are killed. If they are not killed at some point they begin to hold all the ptys and error messages start appearing when users try to do things like use script or login in from a dial-up port(like a sytec box). The error message is 'no more pty's' When we reboot or manually kill those process everything is fine. Checking the device table in /dev the pty nodes that are hung have the permissions unset. The following is a partial listing of this: crw-rw-rw- 1 root wheel 9, 0 Oct 26 10:59 /dev/ttyp0 c--------- 1 root wheel 9, 1 Oct 22 15:45 /dev/ttyp1 c--------- 1 root wheel 9, 2 Oct 22 11:43 /dev/ttyp2 c--------- 1 root wheel 9, 3 Oct 20 14:15 /dev/ttyp3 c--------- 1 root wheel 9, 4 Oct 19 00:20 /dev/ttyp4 c--------- 1 root wheel 9, 5 Oct 19 00:25 /dev/ttyp5 c--------- 1 root wheel 9, 6 Oct 23 08:06 /dev/ttyp6 c--------- 1 root wheel 9, 7 Oct 25 21:44 /dev/ttyp7 crw--w---- 1 fran tty 9, 8 Oct 26 12:11 /dev/ttyp8 crw--w---- 1 ed tty 9, 9 Oct 26 11:27 /dev/ttyp9 c--------- 1 root wheel 9, 10 Oct 23 09:42 /dev/ttypa crw--w---- 1 garav tty 9, 11 Oct 26 12:11 /dev/ttypb c--------- 1 root wheel 9, 12 Oct 22 18:40 /dev/ttypc crw--w---- 1 melanie tty 9, 13 Oct 26 12:05 /dev/ttypd c--------- 1 root wheel 9, 14 Oct 24 15:40 /dev/ttype crw-rw-rw- 1 root wheel 9, 15 Oct 26 12:10 /dev/ttypf c--------- 1 root wheel 9, 16 Oct 23 10:55 /dev/ttyq0 crw-rw-rw- 1 root wheel 9, 17 Oct 26 12:11 /dev/ttyq1 crw-rw-rw- 1 root wheel 9, 18 Oct 26 12:10 /dev/ttyq2 crw-rw-rw- 1 root wheel 9, 19 Oct 26 12:09 /dev/ttyq3 crw-rw-rw- 1 root wheel 9, 20 Oct 26 12:09 /dev/ttyq4 crw--w---- 1 hong tty 9, 21 Oct 26 12:11 /dev/ttyq5 crw--w---- 1 cheng tty 9, 22 Oct 26 12:11 /dev/ttyq6 crw--w---- 1 woods tty 9, 23 Oct 26 12:11 /dev/ttyq7 crw-rw-rw- 1 root wheel 9, 24 Oct 26 12:03 /dev/ttyq8 The lines with root are not being used, the lines with names on them are in use and the lines with root and no set permissions are the hung ptys. An example of the process status follows: root 11576 0.0 0.0 47 3 p5 IW 0:00 rlogind root 5505 0.0 0.0 48 3 p2 IW 0:00 (rlogind) root 26592 0.0 0.0 48 3 q0 IW 0:00 (rlogind) root 22903 0.0 0.0 48 3 pa IW 0:00 (rlogind) root 17988 0.0 0.0 48 3 p3 IW 0:00 (rlogind) root 20143 0.0 0.0 48 3 p6 IW 0:00 (rlogind) root 21751 0.0 0.0 48 3 pc IW 0:00 (rlogind) root 11498 0.0 0.0 47 3 p4 IW 0:00 rlogind Fri Oct 26 10:47:36 PDT 1990 The processes are idle and waiting but I don't know what they are waiting for. They aren't taking any resources except the use of a pty. As long as the system is rebooted they don't present a problem but I would like to know what is causing them and how to fix it so that they system can be allowed to run for extended periods with minimal maintance. Please reply by e-mail and I will summarize for the net. I would like to hear from anyone who has a clue about this. thanks Philip Enteles enteles@tahoe.unr.edu
weimer@ssd.kodak.com (Gary Weimer) (11/08/90)
In article <4852@tahoe.unr.edu> enteles@tahoe.unr.edu (Philip Enteles) writes: > > We are having some trouble at our site and I would like to know >if anyone has had a similar problem. > We are running 4.3 BSD on a Sperry 7000-40. This system is the >backbone of our undergraduate computer science department and has about >500 users. The system itself seems to run fine however left to its own >devices it accumulates a number of rlogind processes that become hung. >We are not sure where these processes are coming from but they only >effect ptys. These stay hung until they are killed. If they are not >killed at some point they begin to hold all the ptys and error messages >start appearing when users try to do things like use script or login in >from a dial-up port(like a sytec box). The error message is > 'no more pty's' >When we reboot or manually kill those process everything is fine. This doesn't help you with the PROBLEM, but it may function as a temporary fix. It is a shell script that will automatically find and kill these errant logind processes (based on the info you supplied). Here is the main script to be called by cron (or manually) (note that program assumes a test mode (see 'set TEST') move the comment to actually perfom kill) ------------------------- CUT HERE ------------------------- #!/bin/csh -fb # # FILE: fxlogin # DESC: fix errant logind processes # # find path and name for this program set PROGRAM = $0 set PATH = $PROGRAM:h set PROGRAM = $PROGRAM:t if ("$PATH" == "$PROGRAM") set PATH=$cwd # assume PARSER (used for awk) has same name as this prog with .parse ext set PARSER = $PATH/$PROGRAM.parse set SIG = -9 # signal for killing processes set TMP = "/tmp/tmp$$" # tmp file set TEST = 1 # test run of program, don't kill processes #set TEST = 0 # the real thing, DO kill processes # find which terminals have errant logind's and put the list in LIST set LIST = ls -l /dev/ttyp? | awk '{if ($3 == "root" && $1 == "c---------") print substr($10,length($10)-2)}' # in case your mailer truncates here is a copy of above line # set LIST = ls -l /dev/ttyp? | awk '{if ($3 == "root" && $1 == "c---------") # print substr($10,length($10)-2)}' # if no processes to kill, then exit if ("$LIST" == "") exit # find all logind process and put them in TMP in '<tty> <pid>' format, # sorted by <tty> ps -aux | grep "rlogind" | grep -v "grep" | awk '{print $7 " " $2}' | sort >$TMP # store pid's of errant logins in PIDS set PIDS = `(echo $LIST; cat $TMP) | awk -f $PARSER` rm $TMP if ("$TEST") then echo "In test mode. Would have killed pids:" echo " $PIDS" else kill $SIG PIDS endif ------------------------- CUT HERE ------------------------- and here is the paser used by above program: ------------------------- CUT HERE ------------------------- # # FILE: fxlogin.parse # DESC: awk script used by fxlogin # # first line of input (nt == 0) is sorted list of tty's to kill processes on # remaining lines have <tty> <pid>, lines sorted in tty order # # nt = number of tty's to kill processes of # ct = current tty to kill processes of # tty[] = array of tty's to kill processes of # # NOTE: does not assume one process per tty # get tty's to kill processes on {if (nt == 0) { if (NF == 0) exit; for (nt=1; nt <= NF; nt++) { tty[nt] = $nt } nt--; ct = 1; next } } # found a process for tty[ct] {if (tty[ct] == $1) { print $2; next } } # tty[ct] has no more processes to kill {while (ct <= nt && tty[ct] < $1) ct++ } # this tty not in kill list ------------------------- CUT HERE ------------------------- I hope you've found the problem and don't need this. Gary