[comp.sys.apollo] tcpd problems

ianb@ocf.Berkeley.EDU (Ian Barkley) (05/23/91)

Hello all. We have been having a problem recently with our tcpd. Recently, the 
connections to the outside world have been having a bad habit of dying or 
freezing for several minutes at a time. When I looked at the process list of
typhoon (our gateway) during one of these cutoffs, I noted that tcpd's 
PRI was down at 8 or less. I tried running tcpd at lower nice numbers,
down to -15, and while this seems to have speeded up net connections in 
general, it hasn't stopped the not-so-occassional net freeze; and a process
listing of typhoon again reveals that tcpd's PRI is at 5 during these freezes.

One time I tried running tcpd's nice number up to -18, so that it couldn't
get a priority less than 3 (that of most of the other processes on typhoon).
It worked fine for a while, then BLAM, the priority went to 3 and tcpd started
eating 100% of the CPU time; the disk was accessable (slowly), but I couldn't
get on and finally had to white-button the machine. 

These freezes seem to happen mostly under heavy loads. They last from a few
seconds to several minutes. Even when tcpd is working, it's PRI tends to
bounce between 1 and 3, even with a nice of -15. 

Typhoon is a 4500, and our only ethernet gateway. The 16 other machines in
the ring are 3500's; typhoon and 8 others have disks. We are running 10.3
in it's BSD4.3-like mode on all machines. 

Relevant listing:

When tcpd is working:
[typhoon:~] 39} netstat -T
167/336 mbufs in use:
         28/112 (  80-byte) mbufs used/allocated
         63/112 (1560-byte) mbufs used/allocated
         76/112 (9216-byte) mbufs used/allocated
1187 Kbytes allocated to network (65% in use)
1757248 requests for memory denied

Name  Mtu   Network     Address      Ipkts   Ierrs Opkts   Oerrs Collis Timer
lo0   9216  127.0.0     localhost    65      0     65      0     0      0
dr0   1268  128.32.234  typhoon.Berk 10405601 0     12147883 458   0      0
eth0  1500  128.32.184  ocf.Berkeley 11450577 0     9618549 11413 0      0

Host            Stat  Ref  Pct  Mapaddr
tornado.Berkele U     0    0    00:01:fe:92:00:00
monsoon.Berkele U     0    0    00:01:ff:33:00:00
cyclone.Berkele U     0    0    00:02:01:36:00:00
whirlwind.Berke U     0    0    00:02:01:32:00:00
hailstorm.Berke U     0    0    00:01:ff:22:00:00
bigbang.Berkele U     0    0    00:01:fe:97:00:00
headcrash.Berke U     0    0    00:01:ed:17:00:00
tempest.Berkele U     1    0    00:02:02:1d:00:00
maelstrom.Berke U     1    0    00:01:f7:c3:00:00
sandstorm.Berke U     0    0    00:01:f8:b5:00:00
earthquake.Berk U     0    0    00:01:fb:ca:00:00
inr-1-ocf.Berke UG    2    0    00:00:93:e0:20:60
avalanche.Berke U     0    0    00:01:fc:0e:00:00
tsunami.Berkele U     0    0    00:02:00:e7:00:00
locusts.Berkele U     0    0    00:02:00:ef:00:00
plague.Berkeley U     0    0    00:02:24:5b:00:00
lightning.Berke U     0    0    00:02:01:42:00:00

Routing tables
Destination     Gateway         Flags    Hops  Ref  Use        Interface
<default>       inr-1-ocf.Berke G        1     2    9389386    eth0

routing:
        0 bad routing redirects
        0 dynamically created routes
        0 new gateways due to redirects
        0 destinations found unreachable
        0 uses of a wildcard route
ip:
        191 bad header checksums
        0 with size smaller than minimum
        0 with data size < data length
        0 with header length < data size
        0 with data length < header length
icmp:
        0 calls to icmp_error
        0 errors not generated 'cuz old message too short
        0 errors not generated 'cuz old message was icmp
        Output histogram:
                echo reply: 853697
                destination unreachable: 16953
                echo: 23057
        0 messages with bad code fields
        0 messages < minimum length
        0 bad checksums
        0 messages with bad length
        853697 message responses generated
        Input histogram:
                echo reply: 22981
                destination unreachable: 41
                source quench: 4
                echo: 853697
                time stamp reply: 22758
tcp:
        1 bad header checksum
        0 bad header offset fields
        0 incomplete headers
udp:
        0 bad header checksums
        0 incomplete headers
        0 bad data length fields

Active connections (including servers)
PCB      Proto Recv-Q Send-Q  Local Address      Foreign Address    Via Gateway
 (state)
3b7315ac ip      1044      0  *.1025             *.*                *

3b73ef30 tcp        0      0  ocf.Berkel.2895    violet.Ber.domain  inr-1-ocf.Be
 ESTABLISHED
3b73d914 tcp        0      0  ocf.Berkel.2894    violet.Ber.domain  inr-1-ocf.Be
 TIME_WAIT
3b740b90 tcp      276      0  ocf.Berkel.telnet  iris.eecs..2898    *
 CLOSE_WAIT
3b7398ec tcp        0      0  *.printer          *.*                *
 LISTEN
3b743a20 tcp        0      0  typhoon.Be.1723    maelstrom..finger  maelstrom.Be
 FIN_WAIT_2
3b735df8 tcp        0      0  typhoon.Be.1721    tempest.Be.finger  tempest.Berk
 FIN_WAIT_2
3b73265c tcp        0      0  *.smtp             *.*                *
 LISTEN
3b730910 tcp        0      0  *.finger           *.*                *
 LISTEN
3b7302cc tcp        0      0  *.ftp              *.*                *
 LISTEN
3b730110 tcp        0      0  *.exec             *.*                *
 LISTEN
3b727f54 tcp        0      0  *.shell            *.*                *
 LISTEN
3b727d98 tcp        0      0  *.login            *.*                *
 LISTEN
3b727bc0 tcp        0      0  *.telnet           *.*                *
 LISTEN
3b73ee08 udp        0      0  ocf.Berkel.2893    violet.Ber.domain  *

3b731d14 udp      609      0  *.syslog           *.*                *

3b7311b4 udp        0      0  *.1024             *.*                *

3b731040 udp        0      0  *.ncs              *.*                *

3b730ee4 udp        0      0  *.timed            *.*                *

3b730dbc udp        0      0  *.who              *.*                *

3b730a08 udp        0      0  *.route            *.*                *

3b730564 udp        0      0  *.biff             *.*                *

3b730420 udp        0      0  *.ntalk            *.*                *

[typhoon:~] 40} ps alx
      F   UID   PID  PPID CP PRI NI ADDR   SZ  RSS STAT TTY      TIME COMMAND
1008001     0     1     1  0   1-20    1  992   13 S <  ?        0:19 /etc/init
      3     0     2     0  0  17  0    0    0    0 R    ?      3626:15 null
      3     0     3     0  0   1  0    0    0    0 S    ?        0:28 purifier
      3     0     4     0  0   1  0    0    0    0 S    ?        0:04 purifier
      3     0     5     0  0   1  0    0    0    0 S    ?       17:47 unwired_dx
m
      3     0     6     0  0   1  0    0    0    0 S    ?        0:00 pinger
      3     0     7     0  0   1  0    0    0    0 S    ?       52:29 netreceive
      3     0     8     0  0   1  0    0    0    0 S    ?       59:25 netpaging
      3     0     9     0  0   1  0    0    0    0 S    ?        2:00 wired_dxm
      3     0    10     0  0   1  0    0    0    0 S    ?       84:43 netrequest
      3     0    15     0  0   1  0    0    0    0 S    ?        0:05 netreceive
1008001     0  1046     1  0   1-15    5 1888   93 S <  ?      875:34 /etc/tcpd
1008001     0  1055     1  0   3  0    6 1920   58 S    ?        0:31 /etc/inetd
1008001     0  1061     1  0   1-10    8 1920   55 S <  ?        2:39 /etc/route
d -f
1008001     0  1065     1  0   4  0    9 1952  180 S    ?        6:06 /etc/rwhod
1008001     0  1068     1  0   5  0    a 2016  361 S    ?        9:17 /etc/timed
 -M
1008001     0  1071     1  0   3  0    4 2080   37 S    ?        0:02 /etc/ncs/l
lbd
1008001     0  1074     1  0   1 -6    b 6624 1337 S <  ?      276:02 /etc/rgyd
1008001     0  1080     1  0   3  0    e  608  130 S    ?        2:31 /etc/cron
1008001     0  1092     1  0   3  0    3 2080   27 S    ?        0:02 /usr/lib/s
endmail -bd
1008001    14  1095     1  0   4  0    c  768  224 S    ?        0:17 /sys/spm/s
pm
1008001     0  1097     1  0   3  0    d  704  599 S    ?        0:15 /sys/net/n
etman
1008001    14  1099     1  0   3  0   11 2208   29 S    ?        0:31 /sys/ns/ns
_helper
1008001    14  1102     1  0   3  0   13  736    9 S    ?        0:00 /etc/write
d
1008001    14  1106     1  0   3  0   14  928   62 S    ?        0:12 /sys/mbx/m
bx_helper
1008001     0  2124     1  0   3  0   15 2048   73 S    ?        0:00 /usr/lib/l
pd
1008001     0 29632     1  0   3  0   22 2272  272 S    ?        1:43 /etc/syslo
gd
1008001  1132  2542  1095  0   4  0    2 1216  297 S    crp00    0:17 /bin/tcsh
1008001  1132  2596  2542  0   3  0    7 1152  200 R    crp00    0:00 ps alx
1008001  1132  2592  1095  0   3  0   12 1184  303 S    crp01    0:02 /bin/tcsh
1008001  5821  2195 28550  0   3  0    f  672   39 S    ttyp1    0:00 sz tape-2.
29
1008001     0  2196  1055  0   3  0   10 2048   61 S    ttyp1    0:00 telnetd
1008001  5821 28550     1  0   3  0   20  928  145 S    ttyp1    0:05 -csh



When tcpd is failing:
[typhoon:~] 41} ps alx
      F   UID   PID  PPID CP PRI NI ADDR   SZ  RSS STAT TTY      TIME COMMAND
1008001     0     1     1  0   1-20    1  992   13 S <  ?        0:19 /etc/init
      3     0     2     0  0  17  0    0    0    0 R    ?      3627:27 null
      3     0     3     0  0   1  0    0    0    0 S    ?        0:28 purifier
      3     0     4     0  0   2  0    0    0    0 S    ?        0:04 purifier
      3     0     5     0  0   1  0    0    0    0 S    ?       17:47 unwired_dx
m
      3     0     6     0  0   1  0    0    0    0 S    ?        0:00 pinger
      3     0     7     0  0   1  0    0    0    0 S    ?       52:29 netreceive
      3     0     8     0  0   1  0    0    0    0 S    ?       59:26 netpaging
      3     0     9     0  0   1  0    0    0    0 S    ?        2:00 wired_dxm
      3     0    10     0  0   1  0    0    0    0 S    ?       84:45 netrequest
      3     0    15     0  0   1  0    0    0    0 S    ?        0:05 netreceive
1008001     0  1046     1  0   5-15    5 1888   71 R <  ?      876:30 /etc/tcpd
1008001     0  1055     1  0   3  0    6 1920   60 S    ?        0:31 /etc/inetd
1008001     0  1061     1  0   1-10    8 1920   55 S <  ?        2:39 /etc/route
d -f
1008001     0  1065     1  0   3  0    9 1952  179 S    ?        6:06 /etc/rwhod
1008001     0  1068     1  0   5  0    a 2016  354 S    ?        9:17 /etc/timed
 -M
1008001     0  1071     1  0   3  0    4 2080   37 S    ?        0:02 /etc/ncs/l
lbd
1008001     0  1074     1  0   2 -6    b 6624 1403 S <  ?      276:03 /etc/rgyd
1008001     0  1080     1  0   3  0    e  608  129 S    ?        2:31 /etc/cron
1008001     0  1092     1  0   3  0    3 2080   27 S    ?        0:02 /usr/lib/s
endmail -bd
1008001    14  1095     1  0   3  0    c  768  221 S    ?        0:17 /sys/spm/s
pm
1008001     0  1097     1  0   3  0    d  704  599 S    ?        0:15 /sys/net/netman
1008001    14  1099     1  0   3  0   11 2208   28 S    ?        0:31 /sys/ns/ns
_helper
1008001    14  1102     1  0   3  0   13  736    9 S    ?        0:00 /etc/write
d
1008001    14  1106     1  0   3  0   14  928   62 S    ?        0:12 /sys/mbx/m
bx_helper
1008001     0  2124     1  0   3  0   15 2048   73 S    ?        0:00 /usr/lib/l
pd
1008001     0 29632     1  0   3  0   22 2272  261 S    ?        1:43 /etc/syslo
gd
1008001  1132  2542  1095  0   3  0    2 1216  306 S    crp00    0:17 /bin/tcsh
1008001  1132  2604  2542  0   3  0    7 1152  215 R    crp00    0:00 ps alx
1008001  5821  2195 28550  0   3  0    f  672   39 S    ttyp1    0:00 sz tape-2.
29
1008001     0  2196  1055  0   3  0   10 2048   61 S    ttyp1    0:00 telnetd
1008001  5821 28550     1  0   3  0   20  928  145 S    ttyp1    0:05 -csh
[typhoon:~] 42} netstat -nT
335/336 mbufs in use:
        111/112 (  80-byte) mbufs used/allocated
        112/112 (1560-byte) mbufs used/allocated
        112/112 (9216-byte) mbufs used/allocated
1187 Kbytes allocated to network (99% in use)
1792761 requests for memory denied

Name  Mtu   Network     Address      Ipkts   Ierrs Opkts   Oerrs Collis Timer
lo0   9216  127.0.0     127.0.0.1    65      0     65      0     0      0
dr0   1268  128.32.234  128.32.234.1 10407469 0     12150184 458   0      0
eth0  1500  128.32.184  128.32.184.2 11452816 0     9620314 11418 0      0

Host            Stat  Ref  Pct  Mapaddr
128.32.234.15   U     0    0    00:01:fe:92:00:00
128.32.234.13   U     0    0    00:01:ff:33:00:00
128.32.234.26   U     0    0    00:02:01:36:00:00
128.32.234.27   U     0    0    00:02:01:32:00:00
128.32.234.24   U     0    0    00:01:ff:22:00:00
128.32.234.30   U     0    0    00:01:fe:97:00:00
128.32.234.31   U     0    0    00:01:ed:17:00:00
128.32.234.28   U     1    0    00:02:02:1d:00:00
128.32.234.29   U     1    0    00:01:f7:c3:00:00
128.32.234.18   U     0    0    00:01:f8:b5:00:00
128.32.234.16   U     0    0    00:01:fb:ca:00:00
128.32.184.1    UG    0    2    00:00:93:e0:20:60
128.32.234.22   U     0    0    00:01:fc:0e:00:00
128.32.234.23   U     0    0    00:02:00:e7:00:00
128.32.234.20   U     0    0    00:02:00:ef:00:00
128.32.234.21   U     0    0    00:02:24:5b:00:00
128.32.234.10   U     0    0    00:02:01:42:00:00

Routing tables
Destination     Gateway         Flags    Hops  Ref  Use        Interface
0.0.0           128.32.184.1    G        1     0    9391104    eth0

routing:
        0 bad routing redirects
        0 dynamically created routes
        0 new gateways due to redirects
        0 destinations found unreachable
        0 uses of a wildcard route
ip:
        191 bad header checksums
        0 with size smaller than minimum
        0 with data size < data length
        0 with header length < data size
        0 with data length < header length
icmp:
        0 calls to icmp_error
        0 errors not generated 'cuz old message too short
        0 errors not generated 'cuz old message was icmp
        Output histogram:
                echo reply: 853896
                destination unreachable: 16956
                echo: 23074
        0 messages with bad code fields
        0 messages < minimum length
        0 bad checksums
        0 messages with bad length
        853896 message responses generated
        Input histogram:
                echo reply: 22998
                destination unreachable: 41
                source quench: 4
                echo: 853896
                time stamp reply: 22758
tcp:
        1 bad header checksum
        0 bad header offset fields
        0 incomplete headers
udp:
        0 bad header checksums
        0 incomplete headers
        0 bad data length fields

Active connections (including servers)
PCB      Proto Recv-Q Send-Q  Local Address      Foreign Address    Via Gateway
 (state)
3b7315ac ip      7884      0  *.1025             *.*                *

3b740b90 tcp      276      0  128.32.184.254.23  128.120.57.20.2898 *
 CLOSE_WAIT
3b7398ec tcp        0      0  *.515              *.*                *
 LISTEN
3b743a20 tcp        0      0  128.32.234.14.1723 128.32.234.29.79   128.32.234.2
9 FIN_WAIT_2
3b735df8 tcp        0      0  128.32.234.14.1721 128.32.234.28.79   128.32.234.2
8 FIN_WAIT_2
3b73265c tcp        0      0  *.25               *.*                *
 LISTEN
3b730910 tcp        0      0  *.79               *.*                *
 LISTEN
3b7302cc tcp        0      0  *.21               *.*                *
 LISTEN
3b730110 tcp        0      0  *.512              *.*                *
 LISTEN
3b727f54 tcp        0      0  *.514              *.*                *
 LISTEN
3b727d98 tcp        0      0  *.513              *.*                *
 LISTEN
3b727bc0 tcp        0      0  *.23               *.*                *
 LISTEN
3b73ee08 udp        0      0  128.32.184.254.2893 128.32.136.22.53   *

3b731d14 udp     1507      0  *.514              *.*                *

3b7311b4 udp        0      0  *.1024             *.*                *

3b731040 udp        0      0  *.135              *.*                *

3b730ee4 udp        0      0  *.525              *.*                *

3b730dbc udp        0      0  *.513              *.*                *

(netstat freezes after printing that line)





Thank you for your time.

-Ian Barkley
 OCF staff
 ianb@ocf.berkeley.edu
 

dbfunk@ICAEN.UIOWA.EDU (David B Funk) (05/23/91)

In posting <IANB.91May22203326@maelstrom.ocf.Berkeley.EDU>, Ian Barkley writes:

>Hello all. We have been having a problem recently with our tcpd. Recently, the 
>connections to the outside world have been having a bad habit of dying or 
>freezing for several minutes at a time. When I looked at the process list of
>
[stuff deleted]
>One time I tried running tcpd's nice number up to -18, so that it couldn't
>get a priority less than 3 (that of most of the other processes on typhoon).
>It worked fine for a while, then BLAM, the priority went to 3 and tcpd started
>eating 100% of the CPU time; the disk was accessable (slowly), but I couldn't
>get on and finally had to white-button the machine. 
>
>These freezes seem to happen mostly under heavy loads. They last from a few
>seconds to several minutes. Even when tcpd is working, it's PRI tends to
>bounce between 1 and 3, even with a nice of -15. 

Yes, we have seen this exact same problem. It is caused by packet buffer exhaustion
(those little buffers work real hard ;).

The clue is the output from a "netstat -m" (or the memory part of the -T output)
Refering back to the output listing from your posting:

> 335/336 mbufs in use:
>         111/112 (  80-byte) mbufs used/allocated
>         112/112 (1560-byte) mbufs used/allocated
>         112/112 (9216-byte) mbufs used/allocated
> 1187 Kbytes allocated to network (99% in use)
> 1792761 requests for memory denied

Note the "99% in use" statistic. All the packet buffers are being used by somebody
so when a packet comes in from the net, it gets dropped on the floor as there is
no place to put it. When you reach this condition, the value of the 
statistic "requests for memory denied" will start going up like crazy.
when this condition is reached, the tcp daemon (/etc/tcpd) will go off into some
routine that tries to expand the memory buffer pool. (If you do a "tb" trace back
on the running tcpd, you can see this; it should be in "expand_mbufpool").
For some reason this routine goes into a spin loop & eats up lots of CPU time
with out doing much. This is the cause of your node lock up when you jacked up the
tcpd priority. (Don't bother to jack the priority, for some reason the routine
never seems to do any thing but eat up CPU time).

THe only cure that I've found for this condition is to find the offending processes
that are holding the buffers & kill them. Once the processes die, their buffers are
released & things usually get back to normal. (In severe cases, reboot or kill all tcp
related stuff & restart tcpd from scratch).

The "active connections list" output from netstat can provide a clue as to who the
real culprit is. Look for connections that have a large Recv-Q or Send-Q value.
(the values seem to saturate at 9100, above 7500 is possible trouble)
(then comes the fun of trying to match the process to the netstat connection).

Some of the problem processes that I have seen are things like:
    A download via a program such as "sz", when the process at the far end
    dies or becomes unreachable due to network problems. (causes a large Send-Q)

    A telnet session where the user is running a program producing lots of output
    and then the remote user gives his telnet program a stop fault (^Z). (Send-Q)

    Incoming ftp data, where the disk is full. (large Recv-Q).

If it is just occasional peaks in the network load, you may just have to live
with it (or reduce the process load on your gateway). It is probably being caused
by a process executing on that node. Pure gateway operations (IE traffic coming from
one remote node and going to some other node) don't seem to eat up lots of buffers.
All the cases that I've seen were due to some user's program executing on the afflicted
node. We have one machine that is our primary gateway/mail-server/name-server, which
handles lots of traffic, but it doesn't have any users on it. This machine handles
millions of packets a day with no "requests for memory denied" errors. We have another
machine that has a popular "MUD" game which can generate a million "requests for memory
denied" errors in a few hours, when it has a heavy load (30+ active users).

Good luck.
Dave Funk

neilb@spectrum.cs.unsw.oz.au (Neil Brown) (05/24/91)

We too have had a number of problems with tcpd spinning.
It invariably involves the mbufs being all used up, as in:

335/336 mbufs in use:
        111/112 (  80-byte) mbufs used/allocated
        112/112 (1560-byte) mbufs used/allocated
        112/112 (9216-byte) mbufs used/allocated

As David Funk mentioned, this can happen when some process has
a socket open but is not processing incoming packets, in which case
killing the process should free things up.

When it happens here though, the reason that the process isn't processing
incoming packets is that it has exited! For some reason this doesn't always
close the socket properly, so we have the situation where tcpd thinks packets
for some particular port are wanted, but no processes actually wants them.
Result: packet buildup and loss of mbufs.
Cure: reboot.

We have told the Australasian Response Center and they have been looking
into the problem. The latest patch tape (May I think) has a new version of
tcpd and installing this has reduced the problem significantly, but it hasn't
gone away completely.

We await further developments.


NeilBrown
neilb@cs.unsw.oz.au

system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (05/25/91)

In article <NEILB.91May24130705@chocolate.spectrum.cs.unsw.oz.au> neilb@spectrum.cs.unsw.oz.au (Neil Brown) writes:
>
>We too have had a number of problems with tcpd spinning.
>It invariably involves the mbufs being all used up, as in:
>
>335/336 mbufs in use:
>        111/112 (  80-byte) mbufs used/allocated
>        112/112 (1560-byte) mbufs used/allocated
>        112/112 (9216-byte) mbufs used/allocated

Our DN10000 has always shown

162/336 mbufs in use:
   112/112 (  80-byte) mbufs used/allocated
    50/112 (1560-byte) mbufs used/allocated
     0/112 (9216-byte) mbufs used/allocated
1187 Kbytes allocated to network (7% in use)
0 requests for memory denied

or even

257/336 mbufs in use:
        112/112 (  80-byte) mbufs used/allocated
        111/112 (1560-byte) mbufs used/allocated
         34/112 (9216-byte) mbufs used/allocated
1187 Kbytes allocated to network (40% in use)
0 requests for memory denied

for the last 2.5 years, which may go part way towards explaining
its TCP hesitations/delays/hangs (the 0 requests for memory denied is
probably because it has just been rebooted after the latest hang).
We regularly have 16 telnetd/rlogind sessions active at once, plus
X Windows, BSD printing, NFS, rwhod, named, sendmail, NNTP (news),
and the Cray satellite station. We are also getting approximately
5 VME Ethernet card resets a day (we are running SR10.3.0.6 which has a
patch to kick the VME Enet card when it hangs) which seems to cause 
TCP sessions to be dropped at random, especially rlogins from other
Apollos.

Mike.
-- 
Mike Peterson, System Administrator, U/Toronto Department of Chemistry
E-mail: system@alchemy.chem.utoronto.ca
Tel: (416) 978-7094                  Fax: (416) 978-8775