[comp.unix.sysv386] ISC TCP/IP 1.2 hangs?

todd@pinhead.pegasus.com (Todd Ogasawara) (06/07/91)

I'm having a problem with ISC's TCP/IP 1.2 and wanted to get some
confirmation before I make a call to ISC and start complaining...

Couple of pieces of information...
1. I'm running ISC TCP/IP 1.2 under ISC UNIX 2.2 on an Everex 386/33
   with 8MB RAM and an 80387. I have a WD8003 ethernet card for use
   with TCP/IP.
2. I've received confirmation from ISC that at least one problem I've
   encountered (telnet sessions remaining hung [as reported by netstat])
   is a bug in TCP/IP 1.2.
3. My problem is that I've noticed that TCP/IP 1.2 is hanging on me
   about once a week now. Sessions will abruptly hang. The processes
   and session are still showing as active as reported by 'ps' and
   'netstat'. However, there is no way to ping the ISC UNIX box from a
   remote station. The ISC UNIX box itself appears to be running ok, except
   for that fact that all network capabilities are lost (it can't ping a
   remote station either).
4. The only solution to restore TCP/IP seems to be to reboot the ISC UNIX
   box. This can get pretty annoying if I have to do it once or twice a
   week.

Has anyone else run into a similar problem (i.e., TCP/IP hanging and
requiring a reboot)? Will TCP/IP 1.3 from ISC fix these problems? And, wher
is TCP/IP 1.3?

-- 
Todd Ogasawara ::: Hawaii Medical Service Association
Internet       ::: todd@pinhead.pegasus.com
Telephone      ::: (808) 536-9162 ext. 7

brennan@merk.UUCP (Rich Brennan) (06/11/91)

In article <1991Jun06.224153.209@pinhead.pegasus.com> todd@pinhead.pegasus.com (Todd Ogasawara) writes:
.I'm having a problem with ISC's TCP/IP 1.2 and wanted to get some
.confirmation before I make a call to ISC and start complaining...
.
.Couple of pieces of information...
.1. I'm running ISC TCP/IP 1.2 under ISC UNIX 2.2 on an Everex 386/33
.   with 8MB RAM and an 80387. I have a WD8003 ethernet card for use
.   with TCP/IP.

It looks like a few people are taking the pipe on this one. I've got the
same (and other) problems. Wild guess: WD8003 or the WD8003 driver have
problems with "high speed" machines, e.g. race conditions in the driver.

.3. My problem is that I've noticed that TCP/IP 1.2 is hanging on me
.   about once a week now. Sessions will abruptly hang. The processes
.   and session are still showing as active as reported by 'ps' and
.   'netstat'. However, there is no way to ping the ISC UNIX box from a
.   remote station. The ISC UNIX box itself appears to be running ok, except
.   for that fact that all network capabilities are lost (it can't ping a
.   remote station either).

I'm able to induce this easily, too. See if TCP/IP is really hung: ping
"localhost". When my machine hangs, localhost still responds to pings,
meaning the trouble is not in TCP/IP, but in the ethernet controller/driver
subsystem.

Another posting here (a reply to my earlier whining) said ISC is working on
these bugs. In the meantime, I'm going to try to snag a 3c503 card to see
if that driver is more robust.


Rich
-- 
brennan@merk.com	...!uunet!merk!brennan		Rich Brennan

dougm@ico.isc.com (Doug McCallum) (06/11/91)

In article <4104@merk.UUCP> brennan@merk.UUCP (Rich Brennan) writes:
...
>It looks like a few people are taking the pipe on this one. I've got the
>same (and other) problems. Wild guess: WD8003 or the WD8003 driver have
>problems with "high speed" machines, e.g. race conditions in the driver.

There are a number of known problems with TCP/IP 1.2 that are being worked
on for the TCP/IP 1.3 release later this summer.  These include a number of
conditions that cause TCP to get hung connections.  They aren't related to
the WD driver (there could be other problems there but we haven't seen them).

>
>.3. My problem is that I've noticed that TCP/IP 1.2 is hanging on me
>.   about once a week now. Sessions will abruptly hang. The processes
>.   and session are still showing as active as reported by 'ps' and
>.   'netstat'. However, there is no way to ping the ISC UNIX box from a
>.   remote station. The ISC UNIX box itself appears to be running ok, except
>.   for that fact that all network capabilities are lost (it can't ping a
>.   remote station either).
>
>I'm able to induce this easily, too. See if TCP/IP is really hung: ping
>"localhost". When my machine hangs, localhost still responds to pings,
>meaning the trouble is not in TCP/IP, but in the ethernet controller/driver
>subsystem.

One of the bugs is aggravated by using the ping command.  You can easily 
get random hung connections by letting ping just run free.  Packets get lost
(actually they don't ever get sent out on the net) and sometimes it interferes
with TCP.

>
>Another posting here (a reply to my earlier whining) said ISC is working on
>these bugs. In the meantime, I'm going to try to snag a 3c503 card to see
>if that driver is more robust.

I suspect that the WD driver is more robust than the 3C503 driver although both
should be fine.  There may be some systems that have problems with older WD
cards so hardware can't be ruled out but in those cases the problems will
usually appear as the network not working at all and will be hardware related
not software.

Doug McCallum
Interactive Systems Corp
dougm@ico.isc.com

robert@towers.uucp (Robert Hoquim) (06/12/91)

brennan@merk.UUCP (Rich Brennan) writes:

>In article <1991Jun06.224153.209@pinhead.pegasus.com>
todd@pinhead.pegasus.com (Todd Ogasawara) writes:
>.I'm having a problem with ISC's TCP/IP 1.2 and wanted to get some
>.confirmation before I make a call to ISC and start complaining...
>.
>.Couple of pieces of information...
>.1. I'm running ISC TCP/IP 1.2 under ISC UNIX 2.2 on an Everex 386/33
>.   with 8MB RAM and an 80387. I have a WD8003 ethernet card for use
>.   with TCP/IP.

>It looks like a few people are taking the pipe on this one. I've got the
>same (and other) problems. Wild guess: WD8003 or the WD8003 driver have
>problems with "high speed" machines, e.g. race conditions in the driver.

I have many machines with WD8003 eithernet cards in them.  The machines
range from 386-33's to 486-33 EISA systems and have no problem with the
8003.  I would look elsewhere for your problem since I can guarantee you
that this card works fine in over 60 machines that I have put them into.
Most were under ISC.

>.3. My problem is that I've noticed that TCP/IP 1.2 is hanging on me
>.   about once a week now. Sessions will abruptly hang. The processes
>.   and session are still showing as active as reported by 'ps' and
>.   'netstat'. However, there is no way to ping the ISC UNIX box from a
>.   remote station.

I found that it seems that the ISC TCP/IP gets lost from time to time and
can no longer find the hosts even if it is talking to one.  I run 1 slip
and 2 eithernet gates in one machine and when using broadcast and netmask
statements in netd.cf instead of what the auto script generates things seem
to work better.  When running a gated systems don't use /etc/gated it flat
has problems, set it up using multiple routed statements to the multiple
sides of the gate.  Even under very heavy load from unix-unix and unix-pci
the network hasn't gone away for a period of months.  I know this is just a
backward way around things, but until ISC can fix gated and other things it
is the best that we can do.

                                                       Bob
--------
 Robert Hoquim - (robert@towers) - voice: 317-255-6807 - fax: 317-259-7289
   Small Systems Specialists - 8500 N. Meridian - Indianapolis, IN 46260
  -- Providing HIGH Performance Unix Systems to YOU is Our ONLY goal! --

brennan@merk.UUCP (Rich Brennan) (06/12/91)

In article <1991Jun11.152602.22404@ico.isc.com> dougm@ico.ISC.COM (Doug McCallum) writes:
>There are a number of known problems with TCP/IP 1.2 that are being worked
>on for the TCP/IP 1.3 release later this summer.  These include a number of
>conditions that cause TCP to get hung connections.  They aren't related to
>the WD driver (there could be other problems there but we haven't seen them).

One of my posted bugs had to do with a panic induced via PC-Interface using
my WD8003. Does PC-Interface use ISC's TCP/IP for communications? If you'd
like, I can try running serial PC-Interface to see if there's some Locus
code panic'ing the system instead of the WD8003 driver. However, note that
I don't get a panic when using TCP/IP - I mearly get a hung TCP/IP.

>One of the bugs is aggravated by using the ping command.  You can easily 
>get random hung connections by letting ping just run free.  Packets get lost
>(actually they don't ever get sent out on the net) and sometimes it interferes
>with TCP.

I don't cause the problem with ping, I was using it after the fact to get
some symptoms. I was logged in from my PC using TCP/IP, and after about
30 minutes of doing a "ls -lR" continuously over a rlogin/telnet connection,
the connection simply hung. To see if it was a TCP/IP problem or ethernet,
I did a ping on "localhost". When it answered, I stopped it, and tried to
ping the ethernet interface. There was no answer from that interface.

>I suspect that the WD driver is more robust than the 3C503 driver although both
>should be fine.  There may be some systems that have problems with older WD
>cards so hardware can't be ruled out but in those cases the problems will
>usually appear as the network not working at all and will be hardware related
>not software.

I've had a few emails saying that switching from WD8003 to a 3C503 fixed
the problem. One even said going from WD8003 to the WD8013 fixed it.

As far as old hardware, unless my distributor is shipping old stuff, my
controller was purchased about 6 months ago. I am running a 8 MHz bus, so
I shouldn't be beating on the card any worse than an newer '286 AT. The only
difference is that my instructions execute a tad faster, so I could be doing
back-to-back board accesses faster.

By the way, thanks for posting here to keep us up-to-date on tracking this
bug.



Rich
-- 
brennan@merk.com	...!uunet!merk!brennan		Rich Brennan

todd@pinhead.pegasus.com (Todd Ogasawara) (06/18/91)

In article <1991Jun06.224153.209@pinhead.pegasus.com> todd@pinhead.pegasus.com (Todd Ogasawara) writes:
]I'm having a problem with ISC's TCP/IP 1.2 and wanted to get some
]confirmation before I make a call to ISC and start complaining...
]
]Couple of pieces of information...
]3. My problem is that I've noticed that TCP/IP 1.2 is hanging on me
]   about once a week now. Sessions will abruptly hang. The processes
]   and session are still showing as active as reported by 'ps' and
]   'netstat'. However, there is no way to ping the ISC UNIX box from a
]   remote station. The ISC UNIX box itself appears to be running ok, except
]   for that fact that all network capabilities are lost (it can't ping a
]   remote station either).
]4. The only solution to restore TCP/IP seems to be to reboot the ISC UNIX
]   box. This can get pretty annoying if I have to do it once or twice a
]   week.

I received a number of replies (including one from a person at ISC)
confirming that this is a known problem (sigh)... A few other folks
suggested a way to get TCP/IP back without rebooting though. My TCP/IP hung
again this morning (after behaving for 10 days). Here is what I ended up
doing.

	login as root on the console
	issue an 'init 1' to bring the system down to single user mode
	request an 'init 2' from the single user menu
	go to multiuser mode using ^D
	issue an 'init 3' as root after getting back to multi user mode

Everything should be ok at this point. If you require a 'route' and don't
have it in your startup files, you will need to issue the appropriate

	route add xxx.x.x.x yyy.y.y.y 1

command.

Of course, what we really need is a fixed TCP/IP 1.3 from ISC.
-- 
Todd Ogasawara ::: Hawaii Medical Service Association
Internet       ::: todd@pinhead.pegasus.com
Telephone      ::: (808) 536-9162 ext. 7

brennan@merk.UUCP (Rich Brennan) (06/18/91)

In article <4105@merk.UUCP> brennan@merk.UUCP (Rich Brennan) writes:
>In article <1991Jun11.152602.22404@ico.isc.com> dougm@ico.ISC.COM (Doug McCallum) writes:
>>I suspect that the WD driver is more robust than the 3C503 driver although both
>>should be fine.  There may be some systems that have problems with older WD
>>cards so hardware can't be ruled out but in those cases the problems will
>>usually appear as the network not working at all and will be hardware related
>>not software.
>
>I've had a few emails saying that switching from WD8003 to a 3C503 fixed
>the problem. One even said going from WD8003 to the WD8013 fixed it.
[etc.]

Well, here's my posting on my findings. I've emailed it to ISC support, too,
so they don't have to scrounge through c.u.sysv386 to find it:

				----

Here's my update to the PC-Interface panic/TCP hangs with ISC 2.2.1 and
a WD8003.

First, I got a call from ISC support. During the course of conversation
I mentioned that I had installed the "network drivers" update NT00001. He
said I didn't need it, and that I should back it out. I did, and sure
enough "netstat -i" doesn't panic the system anymore.

Next I installed the new 3c503 I purchased. I beat on the system pretty
heavily all weekend, and compared with running with the WD8003:

	1) I never panic'ed when running PC-Interface
	2) VP/IX never died giving me "cannot emulate instruction"
		(or similar) diagnostics
	3) TCP/IP never hung

Granted, I didn't run TCP/IP for a week to see if it hung, so that problem
may still be present (I'll let you know in a week).

Just to sanity check that backing out the NT00001 update didn't fix my
problems, I reinstalled my WD8003 card. I even configured the WD8003 to the
identical I/O and shared memory addresses used by the 3C503.

I booted my system, and within 15 minutes of doing a continuous "ls -lR"
over an rlogin connection, VP/IX trapped out with the above error, and the
rlogin connection hung. Now knowing that "netstat -i" wouldn't panic my
system, I tried it: when netstat tried to retrieve the packet counts from
the WD driver, I received an "ioctl timed out" diagnostic. netstat was
able to get the packet counts from the loopback driver, and I was still
able to rlogin using the loopback driver, i.e. TCP/IP wasn't hung, only
the WD8003 subsystem.

I think I'm going to stand by my guess: there's some race condition in the
optimized WD8003 driver causing itself to hangup.

I'm willing to try someone's "known good" WD8003 if the consensus is that
my hardware sucks; we can arrange for some security just so you know I'm
not out collecting Ethernet boards :-).



Rich
-- 
brennan@merk.com	...!uunet!merk!brennan		Rich Brennan

news@heitis1.uucp (News Administrator) (06/19/91)

In article <1991Jun11.152602.22404@ico.isc.com> dougm@ico.ISC.COM (Doug McCallum) writes:
>In article <4104@merk.UUCP> brennan@merk.UUCP (Rich Brennan) writes:
>...
>>It looks like a few people are taking the pipe on this one. I've got the
>>same (and other) problems. Wild guess: WD8003 or the WD8003 driver have
>>problems with "high speed" machines, e.g. race conditions in the driver.
>
>There are a number of known problems with TCP/IP 1.2 that are being worked
>on for the TCP/IP 1.3 release later this summer.  These include a number of
>conditions that cause TCP to get hung connections.  They aren't related to
>the WD driver (there could be other problems there but we haven't seen them).
>

If the WD8003 driver for TCP/IP v1.2 is the same as that supplied in the
Network Drivers Supplement available back around march, it has one major
problem.  (According to people at ISC).  The buffer in the WD8003 8-bit card
is something like 128 or 256 bytes.  The buffer for the WD8003 16-bit card
is double whatever the other one is.  The driver was written expecting the
larger buffer, and will sometimes "overflow".  Also, you cannot use an
8-bit card with a 16-bit card as a gateway, apparently it just loses its
mind entirely.
	brian

todd@pinhead.pegasus.com (Todd Ogasawara) (06/19/91)

In article <1991Jun17.201915.12498@pinhead.pegasus.com> todd@pinhead.pegasus.com (Todd Ogasawara) writes:
]In article <1991Jun06.224153.209@pinhead.pegasus.com> todd@pinhead.pegasus.com (Todd Ogasawara) writes:
]]I'm having a problem with ISC's TCP/IP 1.2 and wanted to get some
]]confirmation before I make a call to ISC and start complaining...

]I received a number of replies (including one from a person at ISC)
]confirming that this is a known problem (sigh)... A few other folks
]suggested a way to get TCP/IP back without rebooting though. My TCP/IP hung
]again this morning (after behaving for 10 days). Here is what I ended up
]doing.
]
]	login as root on the console
]	issue an 'init 1' to bring the system down to single user mode
]	request an 'init 2' from the single user menu
]	go to multiuser mode using ^D
]	issue an 'init 3' as root after getting back to multi user mode

Well, I spoke too soon... ISC TCP/IP 1.2 hung twice more today and the
procedure I described above didn't work. I had to run /etc/shutdown and
reboot the machine to get TCP/IP working again.

It also appears that this problem is not uncommon and has been around since
version ISC TCP/IP 1.0. This implies that I should not expect it to work
any better when ISC TCP/IP 1.3 comes out.

I'm not sure I can live with ISC TCP/IP hanging on me even as infrequently
as once a week (let alone twice a day). I've heard that SCO's TCP/IP is
pretty solid. Does anyone know if it will work with ISC UNIX 2.2? If not,
are their any recommendations for another vendor's TCP/IP? Maybe
Wollongong?

-- 
Todd Ogasawara ::: Hawaii Medical Service Association
Internet       ::: todd@pinhead.pegasus.com
Telephone      ::: (808) 536-9162 ext. 7

cpcahil@virtech.uucp (Conor P. Cahill) (06/19/91)

todd@pinhead.pegasus.com (Todd Ogasawara) writes:

>It also appears that this problem is not uncommon and has been around since
>version ISC TCP/IP 1.0. This implies that I should not expect it to work
>any better when ISC TCP/IP 1.3 comes out.

I don't know how common it is supposed to be, but we have ISC running 
TCP/IP with NFS and X to 3 workstations.  We haven't had TCP lock up since
we upgraded to 2.2 last year.   Our systems are usually up for months at
a time.

PS. None of the machines our clients have exhibit that problem.

-- 
Conor P. Cahill            (703)430-9247        Virtual Technologies, Inc.
uunet!virtech!cpcahil                           46030 Manekin Plaza, Suite 160
                                                Sterling, VA 22170 

kdenning@genesis.Naitc.Com (Karl Denninger) (06/19/91)

In article <1991Jun19.013909.690@pinhead.pegasus.com> todd@pinhead.pegasus.com (Todd Ogasawara) writes:
>
>Well, I spoke too soon... ISC TCP/IP 1.2 hung twice more today and the
>procedure I described above didn't work. I had to run /etc/shutdown and
>reboot the machine to get TCP/IP working again.

I've seen this one.

>It also appears that this problem is not uncommon and has been around since
>version ISC TCP/IP 1.0. This implies that I should not expect it to work
>any better when ISC TCP/IP 1.3 comes out.

The worse problem is that any socket which is left in an "accept" state (ie:
waiting for requests) or active for long periods of time will eventually end
up hanging as well -- in a closed, inaccessible state.

Unless you're looking at that particular port, all appears well.  This means
a network monitor which "pings" it once in a while will report all is ok --
but it most certainly is not!

We have the ported LPR/LPD combination, as well as smail3.  Both will hang
up if left in daemon mode.  Smail3 has a solution -- start it from inetd.

LPR/LPD, unfortunately, doesn't -- so you end up with printers that just
"detach" themselves, and printed files pile up on the user's station.

The only way to clear it is to kill all existing processes that have the
port open, and restart the daemon.

Not at all impressive.

This is the only OS (out of 3 other vendors here at NAITC) that I've seen do
this.  ISC hasn't managed to fix it right in three tries.

--
Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285
kdenning@nis.naitc.com

"The most dangerous command on any computer is the carriage return."
Disclaimer:  The opinions here are solely mine and may or may not reflect
  	     those of the company.

dougm@ico.isc.com (Doug McCallum) (06/20/91)

In article <1991Jun18.184744.357@heitis1.uucp> news@heitis1.uucp (News Administrator) writes:
...
>If the WD8003 driver for TCP/IP v1.2 is the same as that supplied in the
>Network Drivers Supplement available back around march, it has one major
>problem.  (According to people at ISC).  The buffer in the WD8003 8-bit card
>is something like 128 or 256 bytes.  The buffer for the WD8003 16-bit card
>is double whatever the other one is.  The driver was written expecting the
>larger buffer, and will sometimes "overflow".  Also, you cannot use an
>8-bit card with a 16-bit card as a gateway, apparently it just loses its
>mind entirely.

This is nonsense.  The WD8003* boards have an 8K buffer.  The WD8013 cards
may have an 8 or 16K buffer depending on versions, bus, etc.  The driver
was written originally with the ability to use any buffer size.  The very
first version was written using an 8K card (serial number 6 if I remember
correctly) and makes absolutely no assumptions about a minimum or maximum
buffer size.

There is an internal "page" size of 256 bytes that is used by the hardware
but this is not something that effects performance.

An 8K card can overflow if used under heavy load or if packets are sent
at it too fast.  In this case it discards the packet that won't fit in the
receive buffer.  It doesn't lose its mind.  What you were told just sounds 
like someone making up an excuse when they don't understand a problem.

Where you will get real problems is in mixing 16-bit and 8-bit cards
on the same system.  The first Network Drivers Supplement didn't go
through the contortions necessary to work around the wonderful feature
of the AT bus where ALL memory access in the 128K region the board's
RAM appears in must be in 16 bit mode.  This is compensated for in the
second release of the drivers (I don't know if it is shipping yet or not).
To compensate requires a lot of work to avoid what happens if you don't.

If you have the bus in 16 bit mode and try to access 8 bit RAM, you get
trash (well, consistent trash of 0xFF) in the odd bytes.  This is a design
feature of the bus and is the same reason you get problems with 16 bit
VGA boards and 8 bit LAN boards at the same time.

Doug McCallum
Interactive Systems Corp
dougm@ico.isc.com