ianb@ocf.Berkeley.EDU (Ian Barkley) (05/23/91)
Hello all. We have been having a problem recently with our tcpd. Recently, the
connections to the outside world have been having a bad habit of dying or
freezing for several minutes at a time. When I looked at the process list of
typhoon (our gateway) during one of these cutoffs, I noted that tcpd's
PRI was down at 8 or less. I tried running tcpd at lower nice numbers,
down to -15, and while this seems to have speeded up net connections in
general, it hasn't stopped the not-so-occassional net freeze; and a process
listing of typhoon again reveals that tcpd's PRI is at 5 during these freezes.
One time I tried running tcpd's nice number up to -18, so that it couldn't
get a priority less than 3 (that of most of the other processes on typhoon).
It worked fine for a while, then BLAM, the priority went to 3 and tcpd started
eating 100% of the CPU time; the disk was accessable (slowly), but I couldn't
get on and finally had to white-button the machine.
These freezes seem to happen mostly under heavy loads. They last from a few
seconds to several minutes. Even when tcpd is working, it's PRI tends to
bounce between 1 and 3, even with a nice of -15.
Typhoon is a 4500, and our only ethernet gateway. The 16 other machines in
the ring are 3500's; typhoon and 8 others have disks. We are running 10.3
in it's BSD4.3-like mode on all machines.
Relevant listing:
When tcpd is working:
[typhoon:~] 39} netstat -T
167/336 mbufs in use:
28/112 ( 80-byte) mbufs used/allocated
63/112 (1560-byte) mbufs used/allocated
76/112 (9216-byte) mbufs used/allocated
1187 Kbytes allocated to network (65% in use)
1757248 requests for memory denied
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Collis Timer
lo0 9216 127.0.0 localhost 65 0 65 0 0 0
dr0 1268 128.32.234 typhoon.Berk 10405601 0 12147883 458 0 0
eth0 1500 128.32.184 ocf.Berkeley 11450577 0 9618549 11413 0 0
Host Stat Ref Pct Mapaddr
tornado.Berkele U 0 0 00:01:fe:92:00:00
monsoon.Berkele U 0 0 00:01:ff:33:00:00
cyclone.Berkele U 0 0 00:02:01:36:00:00
whirlwind.Berke U 0 0 00:02:01:32:00:00
hailstorm.Berke U 0 0 00:01:ff:22:00:00
bigbang.Berkele U 0 0 00:01:fe:97:00:00
headcrash.Berke U 0 0 00:01:ed:17:00:00
tempest.Berkele U 1 0 00:02:02:1d:00:00
maelstrom.Berke U 1 0 00:01:f7:c3:00:00
sandstorm.Berke U 0 0 00:01:f8:b5:00:00
earthquake.Berk U 0 0 00:01:fb:ca:00:00
inr-1-ocf.Berke UG 2 0 00:00:93:e0:20:60
avalanche.Berke U 0 0 00:01:fc:0e:00:00
tsunami.Berkele U 0 0 00:02:00:e7:00:00
locusts.Berkele U 0 0 00:02:00:ef:00:00
plague.Berkeley U 0 0 00:02:24:5b:00:00
lightning.Berke U 0 0 00:02:01:42:00:00
Routing tables
Destination Gateway Flags Hops Ref Use Interface
<default> inr-1-ocf.Berke G 1 2 9389386 eth0
routing:
0 bad routing redirects
0 dynamically created routes
0 new gateways due to redirects
0 destinations found unreachable
0 uses of a wildcard route
ip:
191 bad header checksums
0 with size smaller than minimum
0 with data size < data length
0 with header length < data size
0 with data length < header length
icmp:
0 calls to icmp_error
0 errors not generated 'cuz old message too short
0 errors not generated 'cuz old message was icmp
Output histogram:
echo reply: 853697
destination unreachable: 16953
echo: 23057
0 messages with bad code fields
0 messages < minimum length
0 bad checksums
0 messages with bad length
853697 message responses generated
Input histogram:
echo reply: 22981
destination unreachable: 41
source quench: 4
echo: 853697
time stamp reply: 22758
tcp:
1 bad header checksum
0 bad header offset fields
0 incomplete headers
udp:
0 bad header checksums
0 incomplete headers
0 bad data length fields
Active connections (including servers)
PCB Proto Recv-Q Send-Q Local Address Foreign Address Via Gateway
(state)
3b7315ac ip 1044 0 *.1025 *.* *
3b73ef30 tcp 0 0 ocf.Berkel.2895 violet.Ber.domain inr-1-ocf.Be
ESTABLISHED
3b73d914 tcp 0 0 ocf.Berkel.2894 violet.Ber.domain inr-1-ocf.Be
TIME_WAIT
3b740b90 tcp 276 0 ocf.Berkel.telnet iris.eecs..2898 *
CLOSE_WAIT
3b7398ec tcp 0 0 *.printer *.* *
LISTEN
3b743a20 tcp 0 0 typhoon.Be.1723 maelstrom..finger maelstrom.Be
FIN_WAIT_2
3b735df8 tcp 0 0 typhoon.Be.1721 tempest.Be.finger tempest.Berk
FIN_WAIT_2
3b73265c tcp 0 0 *.smtp *.* *
LISTEN
3b730910 tcp 0 0 *.finger *.* *
LISTEN
3b7302cc tcp 0 0 *.ftp *.* *
LISTEN
3b730110 tcp 0 0 *.exec *.* *
LISTEN
3b727f54 tcp 0 0 *.shell *.* *
LISTEN
3b727d98 tcp 0 0 *.login *.* *
LISTEN
3b727bc0 tcp 0 0 *.telnet *.* *
LISTEN
3b73ee08 udp 0 0 ocf.Berkel.2893 violet.Ber.domain *
3b731d14 udp 609 0 *.syslog *.* *
3b7311b4 udp 0 0 *.1024 *.* *
3b731040 udp 0 0 *.ncs *.* *
3b730ee4 udp 0 0 *.timed *.* *
3b730dbc udp 0 0 *.who *.* *
3b730a08 udp 0 0 *.route *.* *
3b730564 udp 0 0 *.biff *.* *
3b730420 udp 0 0 *.ntalk *.* *
[typhoon:~] 40} ps alx
F UID PID PPID CP PRI NI ADDR SZ RSS STAT TTY TIME COMMAND
1008001 0 1 1 0 1-20 1 992 13 S < ? 0:19 /etc/init
3 0 2 0 0 17 0 0 0 0 R ? 3626:15 null
3 0 3 0 0 1 0 0 0 0 S ? 0:28 purifier
3 0 4 0 0 1 0 0 0 0 S ? 0:04 purifier
3 0 5 0 0 1 0 0 0 0 S ? 17:47 unwired_dx
m
3 0 6 0 0 1 0 0 0 0 S ? 0:00 pinger
3 0 7 0 0 1 0 0 0 0 S ? 52:29 netreceive
3 0 8 0 0 1 0 0 0 0 S ? 59:25 netpaging
3 0 9 0 0 1 0 0 0 0 S ? 2:00 wired_dxm
3 0 10 0 0 1 0 0 0 0 S ? 84:43 netrequest
3 0 15 0 0 1 0 0 0 0 S ? 0:05 netreceive
1008001 0 1046 1 0 1-15 5 1888 93 S < ? 875:34 /etc/tcpd
1008001 0 1055 1 0 3 0 6 1920 58 S ? 0:31 /etc/inetd
1008001 0 1061 1 0 1-10 8 1920 55 S < ? 2:39 /etc/route
d -f
1008001 0 1065 1 0 4 0 9 1952 180 S ? 6:06 /etc/rwhod
1008001 0 1068 1 0 5 0 a 2016 361 S ? 9:17 /etc/timed
-M
1008001 0 1071 1 0 3 0 4 2080 37 S ? 0:02 /etc/ncs/l
lbd
1008001 0 1074 1 0 1 -6 b 6624 1337 S < ? 276:02 /etc/rgyd
1008001 0 1080 1 0 3 0 e 608 130 S ? 2:31 /etc/cron
1008001 0 1092 1 0 3 0 3 2080 27 S ? 0:02 /usr/lib/s
endmail -bd
1008001 14 1095 1 0 4 0 c 768 224 S ? 0:17 /sys/spm/s
pm
1008001 0 1097 1 0 3 0 d 704 599 S ? 0:15 /sys/net/n
etman
1008001 14 1099 1 0 3 0 11 2208 29 S ? 0:31 /sys/ns/ns
_helper
1008001 14 1102 1 0 3 0 13 736 9 S ? 0:00 /etc/write
d
1008001 14 1106 1 0 3 0 14 928 62 S ? 0:12 /sys/mbx/m
bx_helper
1008001 0 2124 1 0 3 0 15 2048 73 S ? 0:00 /usr/lib/l
pd
1008001 0 29632 1 0 3 0 22 2272 272 S ? 1:43 /etc/syslo
gd
1008001 1132 2542 1095 0 4 0 2 1216 297 S crp00 0:17 /bin/tcsh
1008001 1132 2596 2542 0 3 0 7 1152 200 R crp00 0:00 ps alx
1008001 1132 2592 1095 0 3 0 12 1184 303 S crp01 0:02 /bin/tcsh
1008001 5821 2195 28550 0 3 0 f 672 39 S ttyp1 0:00 sz tape-2.
29
1008001 0 2196 1055 0 3 0 10 2048 61 S ttyp1 0:00 telnetd
1008001 5821 28550 1 0 3 0 20 928 145 S ttyp1 0:05 -csh
When tcpd is failing:
[typhoon:~] 41} ps alx
F UID PID PPID CP PRI NI ADDR SZ RSS STAT TTY TIME COMMAND
1008001 0 1 1 0 1-20 1 992 13 S < ? 0:19 /etc/init
3 0 2 0 0 17 0 0 0 0 R ? 3627:27 null
3 0 3 0 0 1 0 0 0 0 S ? 0:28 purifier
3 0 4 0 0 2 0 0 0 0 S ? 0:04 purifier
3 0 5 0 0 1 0 0 0 0 S ? 17:47 unwired_dx
m
3 0 6 0 0 1 0 0 0 0 S ? 0:00 pinger
3 0 7 0 0 1 0 0 0 0 S ? 52:29 netreceive
3 0 8 0 0 1 0 0 0 0 S ? 59:26 netpaging
3 0 9 0 0 1 0 0 0 0 S ? 2:00 wired_dxm
3 0 10 0 0 1 0 0 0 0 S ? 84:45 netrequest
3 0 15 0 0 1 0 0 0 0 S ? 0:05 netreceive
1008001 0 1046 1 0 5-15 5 1888 71 R < ? 876:30 /etc/tcpd
1008001 0 1055 1 0 3 0 6 1920 60 S ? 0:31 /etc/inetd
1008001 0 1061 1 0 1-10 8 1920 55 S < ? 2:39 /etc/route
d -f
1008001 0 1065 1 0 3 0 9 1952 179 S ? 6:06 /etc/rwhod
1008001 0 1068 1 0 5 0 a 2016 354 S ? 9:17 /etc/timed
-M
1008001 0 1071 1 0 3 0 4 2080 37 S ? 0:02 /etc/ncs/l
lbd
1008001 0 1074 1 0 2 -6 b 6624 1403 S < ? 276:03 /etc/rgyd
1008001 0 1080 1 0 3 0 e 608 129 S ? 2:31 /etc/cron
1008001 0 1092 1 0 3 0 3 2080 27 S ? 0:02 /usr/lib/s
endmail -bd
1008001 14 1095 1 0 3 0 c 768 221 S ? 0:17 /sys/spm/s
pm
1008001 0 1097 1 0 3 0 d 704 599 S ? 0:15 /sys/net/netman
1008001 14 1099 1 0 3 0 11 2208 28 S ? 0:31 /sys/ns/ns
_helper
1008001 14 1102 1 0 3 0 13 736 9 S ? 0:00 /etc/write
d
1008001 14 1106 1 0 3 0 14 928 62 S ? 0:12 /sys/mbx/m
bx_helper
1008001 0 2124 1 0 3 0 15 2048 73 S ? 0:00 /usr/lib/l
pd
1008001 0 29632 1 0 3 0 22 2272 261 S ? 1:43 /etc/syslo
gd
1008001 1132 2542 1095 0 3 0 2 1216 306 S crp00 0:17 /bin/tcsh
1008001 1132 2604 2542 0 3 0 7 1152 215 R crp00 0:00 ps alx
1008001 5821 2195 28550 0 3 0 f 672 39 S ttyp1 0:00 sz tape-2.
29
1008001 0 2196 1055 0 3 0 10 2048 61 S ttyp1 0:00 telnetd
1008001 5821 28550 1 0 3 0 20 928 145 S ttyp1 0:05 -csh
[typhoon:~] 42} netstat -nT
335/336 mbufs in use:
111/112 ( 80-byte) mbufs used/allocated
112/112 (1560-byte) mbufs used/allocated
112/112 (9216-byte) mbufs used/allocated
1187 Kbytes allocated to network (99% in use)
1792761 requests for memory denied
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Collis Timer
lo0 9216 127.0.0 127.0.0.1 65 0 65 0 0 0
dr0 1268 128.32.234 128.32.234.1 10407469 0 12150184 458 0 0
eth0 1500 128.32.184 128.32.184.2 11452816 0 9620314 11418 0 0
Host Stat Ref Pct Mapaddr
128.32.234.15 U 0 0 00:01:fe:92:00:00
128.32.234.13 U 0 0 00:01:ff:33:00:00
128.32.234.26 U 0 0 00:02:01:36:00:00
128.32.234.27 U 0 0 00:02:01:32:00:00
128.32.234.24 U 0 0 00:01:ff:22:00:00
128.32.234.30 U 0 0 00:01:fe:97:00:00
128.32.234.31 U 0 0 00:01:ed:17:00:00
128.32.234.28 U 1 0 00:02:02:1d:00:00
128.32.234.29 U 1 0 00:01:f7:c3:00:00
128.32.234.18 U 0 0 00:01:f8:b5:00:00
128.32.234.16 U 0 0 00:01:fb:ca:00:00
128.32.184.1 UG 0 2 00:00:93:e0:20:60
128.32.234.22 U 0 0 00:01:fc:0e:00:00
128.32.234.23 U 0 0 00:02:00:e7:00:00
128.32.234.20 U 0 0 00:02:00:ef:00:00
128.32.234.21 U 0 0 00:02:24:5b:00:00
128.32.234.10 U 0 0 00:02:01:42:00:00
Routing tables
Destination Gateway Flags Hops Ref Use Interface
0.0.0 128.32.184.1 G 1 0 9391104 eth0
routing:
0 bad routing redirects
0 dynamically created routes
0 new gateways due to redirects
0 destinations found unreachable
0 uses of a wildcard route
ip:
191 bad header checksums
0 with size smaller than minimum
0 with data size < data length
0 with header length < data size
0 with data length < header length
icmp:
0 calls to icmp_error
0 errors not generated 'cuz old message too short
0 errors not generated 'cuz old message was icmp
Output histogram:
echo reply: 853896
destination unreachable: 16956
echo: 23074
0 messages with bad code fields
0 messages < minimum length
0 bad checksums
0 messages with bad length
853896 message responses generated
Input histogram:
echo reply: 22998
destination unreachable: 41
source quench: 4
echo: 853896
time stamp reply: 22758
tcp:
1 bad header checksum
0 bad header offset fields
0 incomplete headers
udp:
0 bad header checksums
0 incomplete headers
0 bad data length fields
Active connections (including servers)
PCB Proto Recv-Q Send-Q Local Address Foreign Address Via Gateway
(state)
3b7315ac ip 7884 0 *.1025 *.* *
3b740b90 tcp 276 0 128.32.184.254.23 128.120.57.20.2898 *
CLOSE_WAIT
3b7398ec tcp 0 0 *.515 *.* *
LISTEN
3b743a20 tcp 0 0 128.32.234.14.1723 128.32.234.29.79 128.32.234.2
9 FIN_WAIT_2
3b735df8 tcp 0 0 128.32.234.14.1721 128.32.234.28.79 128.32.234.2
8 FIN_WAIT_2
3b73265c tcp 0 0 *.25 *.* *
LISTEN
3b730910 tcp 0 0 *.79 *.* *
LISTEN
3b7302cc tcp 0 0 *.21 *.* *
LISTEN
3b730110 tcp 0 0 *.512 *.* *
LISTEN
3b727f54 tcp 0 0 *.514 *.* *
LISTEN
3b727d98 tcp 0 0 *.513 *.* *
LISTEN
3b727bc0 tcp 0 0 *.23 *.* *
LISTEN
3b73ee08 udp 0 0 128.32.184.254.2893 128.32.136.22.53 *
3b731d14 udp 1507 0 *.514 *.* *
3b7311b4 udp 0 0 *.1024 *.* *
3b731040 udp 0 0 *.135 *.* *
3b730ee4 udp 0 0 *.525 *.* *
3b730dbc udp 0 0 *.513 *.* *
(netstat freezes after printing that line)
Thank you for your time.
-Ian Barkley
OCF staff
ianb@ocf.berkeley.edu
dbfunk@ICAEN.UIOWA.EDU (David B Funk) (05/23/91)
In posting <IANB.91May22203326@maelstrom.ocf.Berkeley.EDU>, Ian Barkley writes: >Hello all. We have been having a problem recently with our tcpd. Recently, the >connections to the outside world have been having a bad habit of dying or >freezing for several minutes at a time. When I looked at the process list of > [stuff deleted] >One time I tried running tcpd's nice number up to -18, so that it couldn't >get a priority less than 3 (that of most of the other processes on typhoon). >It worked fine for a while, then BLAM, the priority went to 3 and tcpd started >eating 100% of the CPU time; the disk was accessable (slowly), but I couldn't >get on and finally had to white-button the machine. > >These freezes seem to happen mostly under heavy loads. They last from a few >seconds to several minutes. Even when tcpd is working, it's PRI tends to >bounce between 1 and 3, even with a nice of -15. Yes, we have seen this exact same problem. It is caused by packet buffer exhaustion (those little buffers work real hard ;). The clue is the output from a "netstat -m" (or the memory part of the -T output) Refering back to the output listing from your posting: > 335/336 mbufs in use: > 111/112 ( 80-byte) mbufs used/allocated > 112/112 (1560-byte) mbufs used/allocated > 112/112 (9216-byte) mbufs used/allocated > 1187 Kbytes allocated to network (99% in use) > 1792761 requests for memory denied Note the "99% in use" statistic. All the packet buffers are being used by somebody so when a packet comes in from the net, it gets dropped on the floor as there is no place to put it. When you reach this condition, the value of the statistic "requests for memory denied" will start going up like crazy. when this condition is reached, the tcp daemon (/etc/tcpd) will go off into some routine that tries to expand the memory buffer pool. (If you do a "tb" trace back on the running tcpd, you can see this; it should be in "expand_mbufpool"). For some reason this routine goes into a spin loop & eats up lots of CPU time with out doing much. This is the cause of your node lock up when you jacked up the tcpd priority. (Don't bother to jack the priority, for some reason the routine never seems to do any thing but eat up CPU time). THe only cure that I've found for this condition is to find the offending processes that are holding the buffers & kill them. Once the processes die, their buffers are released & things usually get back to normal. (In severe cases, reboot or kill all tcp related stuff & restart tcpd from scratch). The "active connections list" output from netstat can provide a clue as to who the real culprit is. Look for connections that have a large Recv-Q or Send-Q value. (the values seem to saturate at 9100, above 7500 is possible trouble) (then comes the fun of trying to match the process to the netstat connection). Some of the problem processes that I have seen are things like: A download via a program such as "sz", when the process at the far end dies or becomes unreachable due to network problems. (causes a large Send-Q) A telnet session where the user is running a program producing lots of output and then the remote user gives his telnet program a stop fault (^Z). (Send-Q) Incoming ftp data, where the disk is full. (large Recv-Q). If it is just occasional peaks in the network load, you may just have to live with it (or reduce the process load on your gateway). It is probably being caused by a process executing on that node. Pure gateway operations (IE traffic coming from one remote node and going to some other node) don't seem to eat up lots of buffers. All the cases that I've seen were due to some user's program executing on the afflicted node. We have one machine that is our primary gateway/mail-server/name-server, which handles lots of traffic, but it doesn't have any users on it. This machine handles millions of packets a day with no "requests for memory denied" errors. We have another machine that has a popular "MUD" game which can generate a million "requests for memory denied" errors in a few hours, when it has a heavy load (30+ active users). Good luck. Dave Funk
neilb@spectrum.cs.unsw.oz.au (Neil Brown) (05/24/91)
We too have had a number of problems with tcpd spinning.
It invariably involves the mbufs being all used up, as in:
335/336 mbufs in use:
111/112 ( 80-byte) mbufs used/allocated
112/112 (1560-byte) mbufs used/allocated
112/112 (9216-byte) mbufs used/allocated
As David Funk mentioned, this can happen when some process has
a socket open but is not processing incoming packets, in which case
killing the process should free things up.
When it happens here though, the reason that the process isn't processing
incoming packets is that it has exited! For some reason this doesn't always
close the socket properly, so we have the situation where tcpd thinks packets
for some particular port are wanted, but no processes actually wants them.
Result: packet buildup and loss of mbufs.
Cure: reboot.
We have told the Australasian Response Center and they have been looking
into the problem. The latest patch tape (May I think) has a new version of
tcpd and installing this has reduced the problem significantly, but it hasn't
gone away completely.
We await further developments.
NeilBrown
neilb@cs.unsw.oz.ausystem@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (05/25/91)
In article <NEILB.91May24130705@chocolate.spectrum.cs.unsw.oz.au> neilb@spectrum.cs.unsw.oz.au (Neil Brown) writes: > >We too have had a number of problems with tcpd spinning. >It invariably involves the mbufs being all used up, as in: > >335/336 mbufs in use: > 111/112 ( 80-byte) mbufs used/allocated > 112/112 (1560-byte) mbufs used/allocated > 112/112 (9216-byte) mbufs used/allocated Our DN10000 has always shown 162/336 mbufs in use: 112/112 ( 80-byte) mbufs used/allocated 50/112 (1560-byte) mbufs used/allocated 0/112 (9216-byte) mbufs used/allocated 1187 Kbytes allocated to network (7% in use) 0 requests for memory denied or even 257/336 mbufs in use: 112/112 ( 80-byte) mbufs used/allocated 111/112 (1560-byte) mbufs used/allocated 34/112 (9216-byte) mbufs used/allocated 1187 Kbytes allocated to network (40% in use) 0 requests for memory denied for the last 2.5 years, which may go part way towards explaining its TCP hesitations/delays/hangs (the 0 requests for memory denied is probably because it has just been rebooted after the latest hang). We regularly have 16 telnetd/rlogind sessions active at once, plus X Windows, BSD printing, NFS, rwhod, named, sendmail, NNTP (news), and the Cray satellite station. We are also getting approximately 5 VME Ethernet card resets a day (we are running SR10.3.0.6 which has a patch to kick the VME Enet card when it hangs) which seems to cause TCP sessions to be dropped at random, especially rlogins from other Apollos. Mike. -- Mike Peterson, System Administrator, U/Toronto Department of Chemistry E-mail: system@alchemy.chem.utoronto.ca Tel: (416) 978-7094 Fax: (416) 978-8775