ianb@ocf.Berkeley.EDU (Ian Barkley) (05/23/91)
Hello all. We have been having a problem recently with our tcpd. Recently, the connections to the outside world have been having a bad habit of dying or freezing for several minutes at a time. When I looked at the process list of typhoon (our gateway) during one of these cutoffs, I noted that tcpd's PRI was down at 8 or less. I tried running tcpd at lower nice numbers, down to -15, and while this seems to have speeded up net connections in general, it hasn't stopped the not-so-occassional net freeze; and a process listing of typhoon again reveals that tcpd's PRI is at 5 during these freezes. One time I tried running tcpd's nice number up to -18, so that it couldn't get a priority less than 3 (that of most of the other processes on typhoon). It worked fine for a while, then BLAM, the priority went to 3 and tcpd started eating 100% of the CPU time; the disk was accessable (slowly), but I couldn't get on and finally had to white-button the machine. These freezes seem to happen mostly under heavy loads. They last from a few seconds to several minutes. Even when tcpd is working, it's PRI tends to bounce between 1 and 3, even with a nice of -15. Typhoon is a 4500, and our only ethernet gateway. The 16 other machines in the ring are 3500's; typhoon and 8 others have disks. We are running 10.3 in it's BSD4.3-like mode on all machines. Relevant listing: When tcpd is working: [typhoon:~] 39} netstat -T 167/336 mbufs in use: 28/112 ( 80-byte) mbufs used/allocated 63/112 (1560-byte) mbufs used/allocated 76/112 (9216-byte) mbufs used/allocated 1187 Kbytes allocated to network (65% in use) 1757248 requests for memory denied Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Collis Timer lo0 9216 127.0.0 localhost 65 0 65 0 0 0 dr0 1268 128.32.234 typhoon.Berk 10405601 0 12147883 458 0 0 eth0 1500 128.32.184 ocf.Berkeley 11450577 0 9618549 11413 0 0 Host Stat Ref Pct Mapaddr tornado.Berkele U 0 0 00:01:fe:92:00:00 monsoon.Berkele U 0 0 00:01:ff:33:00:00 cyclone.Berkele U 0 0 00:02:01:36:00:00 whirlwind.Berke U 0 0 00:02:01:32:00:00 hailstorm.Berke U 0 0 00:01:ff:22:00:00 bigbang.Berkele U 0 0 00:01:fe:97:00:00 headcrash.Berke U 0 0 00:01:ed:17:00:00 tempest.Berkele U 1 0 00:02:02:1d:00:00 maelstrom.Berke U 1 0 00:01:f7:c3:00:00 sandstorm.Berke U 0 0 00:01:f8:b5:00:00 earthquake.Berk U 0 0 00:01:fb:ca:00:00 inr-1-ocf.Berke UG 2 0 00:00:93:e0:20:60 avalanche.Berke U 0 0 00:01:fc:0e:00:00 tsunami.Berkele U 0 0 00:02:00:e7:00:00 locusts.Berkele U 0 0 00:02:00:ef:00:00 plague.Berkeley U 0 0 00:02:24:5b:00:00 lightning.Berke U 0 0 00:02:01:42:00:00 Routing tables Destination Gateway Flags Hops Ref Use Interface <default> inr-1-ocf.Berke G 1 2 9389386 eth0 routing: 0 bad routing redirects 0 dynamically created routes 0 new gateways due to redirects 0 destinations found unreachable 0 uses of a wildcard route ip: 191 bad header checksums 0 with size smaller than minimum 0 with data size < data length 0 with header length < data size 0 with data length < header length icmp: 0 calls to icmp_error 0 errors not generated 'cuz old message too short 0 errors not generated 'cuz old message was icmp Output histogram: echo reply: 853697 destination unreachable: 16953 echo: 23057 0 messages with bad code fields 0 messages < minimum length 0 bad checksums 0 messages with bad length 853697 message responses generated Input histogram: echo reply: 22981 destination unreachable: 41 source quench: 4 echo: 853697 time stamp reply: 22758 tcp: 1 bad header checksum 0 bad header offset fields 0 incomplete headers udp: 0 bad header checksums 0 incomplete headers 0 bad data length fields Active connections (including servers) PCB Proto Recv-Q Send-Q Local Address Foreign Address Via Gateway (state) 3b7315ac ip 1044 0 *.1025 *.* * 3b73ef30 tcp 0 0 ocf.Berkel.2895 violet.Ber.domain inr-1-ocf.Be ESTABLISHED 3b73d914 tcp 0 0 ocf.Berkel.2894 violet.Ber.domain inr-1-ocf.Be TIME_WAIT 3b740b90 tcp 276 0 ocf.Berkel.telnet iris.eecs..2898 * CLOSE_WAIT 3b7398ec tcp 0 0 *.printer *.* * LISTEN 3b743a20 tcp 0 0 typhoon.Be.1723 maelstrom..finger maelstrom.Be FIN_WAIT_2 3b735df8 tcp 0 0 typhoon.Be.1721 tempest.Be.finger tempest.Berk FIN_WAIT_2 3b73265c tcp 0 0 *.smtp *.* * LISTEN 3b730910 tcp 0 0 *.finger *.* * LISTEN 3b7302cc tcp 0 0 *.ftp *.* * LISTEN 3b730110 tcp 0 0 *.exec *.* * LISTEN 3b727f54 tcp 0 0 *.shell *.* * LISTEN 3b727d98 tcp 0 0 *.login *.* * LISTEN 3b727bc0 tcp 0 0 *.telnet *.* * LISTEN 3b73ee08 udp 0 0 ocf.Berkel.2893 violet.Ber.domain * 3b731d14 udp 609 0 *.syslog *.* * 3b7311b4 udp 0 0 *.1024 *.* * 3b731040 udp 0 0 *.ncs *.* * 3b730ee4 udp 0 0 *.timed *.* * 3b730dbc udp 0 0 *.who *.* * 3b730a08 udp 0 0 *.route *.* * 3b730564 udp 0 0 *.biff *.* * 3b730420 udp 0 0 *.ntalk *.* * [typhoon:~] 40} ps alx F UID PID PPID CP PRI NI ADDR SZ RSS STAT TTY TIME COMMAND 1008001 0 1 1 0 1-20 1 992 13 S < ? 0:19 /etc/init 3 0 2 0 0 17 0 0 0 0 R ? 3626:15 null 3 0 3 0 0 1 0 0 0 0 S ? 0:28 purifier 3 0 4 0 0 1 0 0 0 0 S ? 0:04 purifier 3 0 5 0 0 1 0 0 0 0 S ? 17:47 unwired_dx m 3 0 6 0 0 1 0 0 0 0 S ? 0:00 pinger 3 0 7 0 0 1 0 0 0 0 S ? 52:29 netreceive 3 0 8 0 0 1 0 0 0 0 S ? 59:25 netpaging 3 0 9 0 0 1 0 0 0 0 S ? 2:00 wired_dxm 3 0 10 0 0 1 0 0 0 0 S ? 84:43 netrequest 3 0 15 0 0 1 0 0 0 0 S ? 0:05 netreceive 1008001 0 1046 1 0 1-15 5 1888 93 S < ? 875:34 /etc/tcpd 1008001 0 1055 1 0 3 0 6 1920 58 S ? 0:31 /etc/inetd 1008001 0 1061 1 0 1-10 8 1920 55 S < ? 2:39 /etc/route d -f 1008001 0 1065 1 0 4 0 9 1952 180 S ? 6:06 /etc/rwhod 1008001 0 1068 1 0 5 0 a 2016 361 S ? 9:17 /etc/timed -M 1008001 0 1071 1 0 3 0 4 2080 37 S ? 0:02 /etc/ncs/l lbd 1008001 0 1074 1 0 1 -6 b 6624 1337 S < ? 276:02 /etc/rgyd 1008001 0 1080 1 0 3 0 e 608 130 S ? 2:31 /etc/cron 1008001 0 1092 1 0 3 0 3 2080 27 S ? 0:02 /usr/lib/s endmail -bd 1008001 14 1095 1 0 4 0 c 768 224 S ? 0:17 /sys/spm/s pm 1008001 0 1097 1 0 3 0 d 704 599 S ? 0:15 /sys/net/n etman 1008001 14 1099 1 0 3 0 11 2208 29 S ? 0:31 /sys/ns/ns _helper 1008001 14 1102 1 0 3 0 13 736 9 S ? 0:00 /etc/write d 1008001 14 1106 1 0 3 0 14 928 62 S ? 0:12 /sys/mbx/m bx_helper 1008001 0 2124 1 0 3 0 15 2048 73 S ? 0:00 /usr/lib/l pd 1008001 0 29632 1 0 3 0 22 2272 272 S ? 1:43 /etc/syslo gd 1008001 1132 2542 1095 0 4 0 2 1216 297 S crp00 0:17 /bin/tcsh 1008001 1132 2596 2542 0 3 0 7 1152 200 R crp00 0:00 ps alx 1008001 1132 2592 1095 0 3 0 12 1184 303 S crp01 0:02 /bin/tcsh 1008001 5821 2195 28550 0 3 0 f 672 39 S ttyp1 0:00 sz tape-2. 29 1008001 0 2196 1055 0 3 0 10 2048 61 S ttyp1 0:00 telnetd 1008001 5821 28550 1 0 3 0 20 928 145 S ttyp1 0:05 -csh When tcpd is failing: [typhoon:~] 41} ps alx F UID PID PPID CP PRI NI ADDR SZ RSS STAT TTY TIME COMMAND 1008001 0 1 1 0 1-20 1 992 13 S < ? 0:19 /etc/init 3 0 2 0 0 17 0 0 0 0 R ? 3627:27 null 3 0 3 0 0 1 0 0 0 0 S ? 0:28 purifier 3 0 4 0 0 2 0 0 0 0 S ? 0:04 purifier 3 0 5 0 0 1 0 0 0 0 S ? 17:47 unwired_dx m 3 0 6 0 0 1 0 0 0 0 S ? 0:00 pinger 3 0 7 0 0 1 0 0 0 0 S ? 52:29 netreceive 3 0 8 0 0 1 0 0 0 0 S ? 59:26 netpaging 3 0 9 0 0 1 0 0 0 0 S ? 2:00 wired_dxm 3 0 10 0 0 1 0 0 0 0 S ? 84:45 netrequest 3 0 15 0 0 1 0 0 0 0 S ? 0:05 netreceive 1008001 0 1046 1 0 5-15 5 1888 71 R < ? 876:30 /etc/tcpd 1008001 0 1055 1 0 3 0 6 1920 60 S ? 0:31 /etc/inetd 1008001 0 1061 1 0 1-10 8 1920 55 S < ? 2:39 /etc/route d -f 1008001 0 1065 1 0 3 0 9 1952 179 S ? 6:06 /etc/rwhod 1008001 0 1068 1 0 5 0 a 2016 354 S ? 9:17 /etc/timed -M 1008001 0 1071 1 0 3 0 4 2080 37 S ? 0:02 /etc/ncs/l lbd 1008001 0 1074 1 0 2 -6 b 6624 1403 S < ? 276:03 /etc/rgyd 1008001 0 1080 1 0 3 0 e 608 129 S ? 2:31 /etc/cron 1008001 0 1092 1 0 3 0 3 2080 27 S ? 0:02 /usr/lib/s endmail -bd 1008001 14 1095 1 0 3 0 c 768 221 S ? 0:17 /sys/spm/s pm 1008001 0 1097 1 0 3 0 d 704 599 S ? 0:15 /sys/net/netman 1008001 14 1099 1 0 3 0 11 2208 28 S ? 0:31 /sys/ns/ns _helper 1008001 14 1102 1 0 3 0 13 736 9 S ? 0:00 /etc/write d 1008001 14 1106 1 0 3 0 14 928 62 S ? 0:12 /sys/mbx/m bx_helper 1008001 0 2124 1 0 3 0 15 2048 73 S ? 0:00 /usr/lib/l pd 1008001 0 29632 1 0 3 0 22 2272 261 S ? 1:43 /etc/syslo gd 1008001 1132 2542 1095 0 3 0 2 1216 306 S crp00 0:17 /bin/tcsh 1008001 1132 2604 2542 0 3 0 7 1152 215 R crp00 0:00 ps alx 1008001 5821 2195 28550 0 3 0 f 672 39 S ttyp1 0:00 sz tape-2. 29 1008001 0 2196 1055 0 3 0 10 2048 61 S ttyp1 0:00 telnetd 1008001 5821 28550 1 0 3 0 20 928 145 S ttyp1 0:05 -csh [typhoon:~] 42} netstat -nT 335/336 mbufs in use: 111/112 ( 80-byte) mbufs used/allocated 112/112 (1560-byte) mbufs used/allocated 112/112 (9216-byte) mbufs used/allocated 1187 Kbytes allocated to network (99% in use) 1792761 requests for memory denied Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Collis Timer lo0 9216 127.0.0 127.0.0.1 65 0 65 0 0 0 dr0 1268 128.32.234 128.32.234.1 10407469 0 12150184 458 0 0 eth0 1500 128.32.184 128.32.184.2 11452816 0 9620314 11418 0 0 Host Stat Ref Pct Mapaddr 128.32.234.15 U 0 0 00:01:fe:92:00:00 128.32.234.13 U 0 0 00:01:ff:33:00:00 128.32.234.26 U 0 0 00:02:01:36:00:00 128.32.234.27 U 0 0 00:02:01:32:00:00 128.32.234.24 U 0 0 00:01:ff:22:00:00 128.32.234.30 U 0 0 00:01:fe:97:00:00 128.32.234.31 U 0 0 00:01:ed:17:00:00 128.32.234.28 U 1 0 00:02:02:1d:00:00 128.32.234.29 U 1 0 00:01:f7:c3:00:00 128.32.234.18 U 0 0 00:01:f8:b5:00:00 128.32.234.16 U 0 0 00:01:fb:ca:00:00 128.32.184.1 UG 0 2 00:00:93:e0:20:60 128.32.234.22 U 0 0 00:01:fc:0e:00:00 128.32.234.23 U 0 0 00:02:00:e7:00:00 128.32.234.20 U 0 0 00:02:00:ef:00:00 128.32.234.21 U 0 0 00:02:24:5b:00:00 128.32.234.10 U 0 0 00:02:01:42:00:00 Routing tables Destination Gateway Flags Hops Ref Use Interface 0.0.0 128.32.184.1 G 1 0 9391104 eth0 routing: 0 bad routing redirects 0 dynamically created routes 0 new gateways due to redirects 0 destinations found unreachable 0 uses of a wildcard route ip: 191 bad header checksums 0 with size smaller than minimum 0 with data size < data length 0 with header length < data size 0 with data length < header length icmp: 0 calls to icmp_error 0 errors not generated 'cuz old message too short 0 errors not generated 'cuz old message was icmp Output histogram: echo reply: 853896 destination unreachable: 16956 echo: 23074 0 messages with bad code fields 0 messages < minimum length 0 bad checksums 0 messages with bad length 853896 message responses generated Input histogram: echo reply: 22998 destination unreachable: 41 source quench: 4 echo: 853896 time stamp reply: 22758 tcp: 1 bad header checksum 0 bad header offset fields 0 incomplete headers udp: 0 bad header checksums 0 incomplete headers 0 bad data length fields Active connections (including servers) PCB Proto Recv-Q Send-Q Local Address Foreign Address Via Gateway (state) 3b7315ac ip 7884 0 *.1025 *.* * 3b740b90 tcp 276 0 128.32.184.254.23 128.120.57.20.2898 * CLOSE_WAIT 3b7398ec tcp 0 0 *.515 *.* * LISTEN 3b743a20 tcp 0 0 128.32.234.14.1723 128.32.234.29.79 128.32.234.2 9 FIN_WAIT_2 3b735df8 tcp 0 0 128.32.234.14.1721 128.32.234.28.79 128.32.234.2 8 FIN_WAIT_2 3b73265c tcp 0 0 *.25 *.* * LISTEN 3b730910 tcp 0 0 *.79 *.* * LISTEN 3b7302cc tcp 0 0 *.21 *.* * LISTEN 3b730110 tcp 0 0 *.512 *.* * LISTEN 3b727f54 tcp 0 0 *.514 *.* * LISTEN 3b727d98 tcp 0 0 *.513 *.* * LISTEN 3b727bc0 tcp 0 0 *.23 *.* * LISTEN 3b73ee08 udp 0 0 128.32.184.254.2893 128.32.136.22.53 * 3b731d14 udp 1507 0 *.514 *.* * 3b7311b4 udp 0 0 *.1024 *.* * 3b731040 udp 0 0 *.135 *.* * 3b730ee4 udp 0 0 *.525 *.* * 3b730dbc udp 0 0 *.513 *.* * (netstat freezes after printing that line) Thank you for your time. -Ian Barkley OCF staff ianb@ocf.berkeley.edu
dbfunk@ICAEN.UIOWA.EDU (David B Funk) (05/23/91)
In posting <IANB.91May22203326@maelstrom.ocf.Berkeley.EDU>, Ian Barkley writes: >Hello all. We have been having a problem recently with our tcpd. Recently, the >connections to the outside world have been having a bad habit of dying or >freezing for several minutes at a time. When I looked at the process list of > [stuff deleted] >One time I tried running tcpd's nice number up to -18, so that it couldn't >get a priority less than 3 (that of most of the other processes on typhoon). >It worked fine for a while, then BLAM, the priority went to 3 and tcpd started >eating 100% of the CPU time; the disk was accessable (slowly), but I couldn't >get on and finally had to white-button the machine. > >These freezes seem to happen mostly under heavy loads. They last from a few >seconds to several minutes. Even when tcpd is working, it's PRI tends to >bounce between 1 and 3, even with a nice of -15. Yes, we have seen this exact same problem. It is caused by packet buffer exhaustion (those little buffers work real hard ;). The clue is the output from a "netstat -m" (or the memory part of the -T output) Refering back to the output listing from your posting: > 335/336 mbufs in use: > 111/112 ( 80-byte) mbufs used/allocated > 112/112 (1560-byte) mbufs used/allocated > 112/112 (9216-byte) mbufs used/allocated > 1187 Kbytes allocated to network (99% in use) > 1792761 requests for memory denied Note the "99% in use" statistic. All the packet buffers are being used by somebody so when a packet comes in from the net, it gets dropped on the floor as there is no place to put it. When you reach this condition, the value of the statistic "requests for memory denied" will start going up like crazy. when this condition is reached, the tcp daemon (/etc/tcpd) will go off into some routine that tries to expand the memory buffer pool. (If you do a "tb" trace back on the running tcpd, you can see this; it should be in "expand_mbufpool"). For some reason this routine goes into a spin loop & eats up lots of CPU time with out doing much. This is the cause of your node lock up when you jacked up the tcpd priority. (Don't bother to jack the priority, for some reason the routine never seems to do any thing but eat up CPU time). THe only cure that I've found for this condition is to find the offending processes that are holding the buffers & kill them. Once the processes die, their buffers are released & things usually get back to normal. (In severe cases, reboot or kill all tcp related stuff & restart tcpd from scratch). The "active connections list" output from netstat can provide a clue as to who the real culprit is. Look for connections that have a large Recv-Q or Send-Q value. (the values seem to saturate at 9100, above 7500 is possible trouble) (then comes the fun of trying to match the process to the netstat connection). Some of the problem processes that I have seen are things like: A download via a program such as "sz", when the process at the far end dies or becomes unreachable due to network problems. (causes a large Send-Q) A telnet session where the user is running a program producing lots of output and then the remote user gives his telnet program a stop fault (^Z). (Send-Q) Incoming ftp data, where the disk is full. (large Recv-Q). If it is just occasional peaks in the network load, you may just have to live with it (or reduce the process load on your gateway). It is probably being caused by a process executing on that node. Pure gateway operations (IE traffic coming from one remote node and going to some other node) don't seem to eat up lots of buffers. All the cases that I've seen were due to some user's program executing on the afflicted node. We have one machine that is our primary gateway/mail-server/name-server, which handles lots of traffic, but it doesn't have any users on it. This machine handles millions of packets a day with no "requests for memory denied" errors. We have another machine that has a popular "MUD" game which can generate a million "requests for memory denied" errors in a few hours, when it has a heavy load (30+ active users). Good luck. Dave Funk
neilb@spectrum.cs.unsw.oz.au (Neil Brown) (05/24/91)
We too have had a number of problems with tcpd spinning. It invariably involves the mbufs being all used up, as in: 335/336 mbufs in use: 111/112 ( 80-byte) mbufs used/allocated 112/112 (1560-byte) mbufs used/allocated 112/112 (9216-byte) mbufs used/allocated As David Funk mentioned, this can happen when some process has a socket open but is not processing incoming packets, in which case killing the process should free things up. When it happens here though, the reason that the process isn't processing incoming packets is that it has exited! For some reason this doesn't always close the socket properly, so we have the situation where tcpd thinks packets for some particular port are wanted, but no processes actually wants them. Result: packet buildup and loss of mbufs. Cure: reboot. We have told the Australasian Response Center and they have been looking into the problem. The latest patch tape (May I think) has a new version of tcpd and installing this has reduced the problem significantly, but it hasn't gone away completely. We await further developments. NeilBrown neilb@cs.unsw.oz.au
system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (05/25/91)
In article <NEILB.91May24130705@chocolate.spectrum.cs.unsw.oz.au> neilb@spectrum.cs.unsw.oz.au (Neil Brown) writes: > >We too have had a number of problems with tcpd spinning. >It invariably involves the mbufs being all used up, as in: > >335/336 mbufs in use: > 111/112 ( 80-byte) mbufs used/allocated > 112/112 (1560-byte) mbufs used/allocated > 112/112 (9216-byte) mbufs used/allocated Our DN10000 has always shown 162/336 mbufs in use: 112/112 ( 80-byte) mbufs used/allocated 50/112 (1560-byte) mbufs used/allocated 0/112 (9216-byte) mbufs used/allocated 1187 Kbytes allocated to network (7% in use) 0 requests for memory denied or even 257/336 mbufs in use: 112/112 ( 80-byte) mbufs used/allocated 111/112 (1560-byte) mbufs used/allocated 34/112 (9216-byte) mbufs used/allocated 1187 Kbytes allocated to network (40% in use) 0 requests for memory denied for the last 2.5 years, which may go part way towards explaining its TCP hesitations/delays/hangs (the 0 requests for memory denied is probably because it has just been rebooted after the latest hang). We regularly have 16 telnetd/rlogind sessions active at once, plus X Windows, BSD printing, NFS, rwhod, named, sendmail, NNTP (news), and the Cray satellite station. We are also getting approximately 5 VME Ethernet card resets a day (we are running SR10.3.0.6 which has a patch to kick the VME Enet card when it hangs) which seems to cause TCP sessions to be dropped at random, especially rlogins from other Apollos. Mike. -- Mike Peterson, System Administrator, U/Toronto Department of Chemistry E-mail: system@alchemy.chem.utoronto.ca Tel: (416) 978-7094 Fax: (416) 978-8775