[comp.protocols.tcp-ip] SUMMARY: Host Lockup During Socket Transfers

THIER@ORCAD2.dnet.ge.com (05/14/91)

My original question:

>   In attempting to run TCP socket transfers between two processes
>residing in the same SPARCstation host, I am experiencing system 
>lock-up after some number of transfers have taken place (the number
>of transfers varys; it's usually between 950 and 2850 with 30 ms. 
>intermessage spacing). Message sizes are 4496 octets. System 
>configuration is:
>
>                  SPARCstation 1+
>                  SunOS 4.1
>                  GENERIC Kernel.
>                  16 MB memory.
>                  210 MB disk.
>                  56MB swap.
>                  Send/Receive Queues = 4096
 

The trophys go to the following who discerned that I was in need
of the Loopback patch (100159-01):


Hal Stern        stern@sunne.East.Sun.COM
Mark Plotnick    mp@allegra.att.com
Daniel Quinlan   danq%chs@boulder.colorado.edu

I was surprised to find that the loopback driver gets invoked even
without using the loopback address, (Thanks again to Hal Stern):
 
	"when the IP layer sees dest IP == my IP, it gives the packet
    to the lo device driver instead of the ie or le driver."
 


Other Responses:
---------------
from  Lixia Zhang  lixia@parc.xerox.com

Dont know if you've already got help from others.  To me the system
resources you ran out seem to be the ports.  I believe you can only
have a limited number of active tcp ports.  When transmission is finished,
the closing connection must wait for 2*T time period before the port being
freed, where T is the max life time of pkt in the net (this wait period
is for reliability reason, to make sure all previous pkts have dead before
the port can be reused).  You may check the appendix of RFC1185 to find
out how long this T value is (if my memory is not faulty, I remember it
is mentioned there).
 


From:  Mike Raffety  mcnc!oddjob.uchicago.edu!oconnor!miker

I don't suppose your transmitting host has been up a long time (e.g.,
100-130 days), has it?  I discovered a bug a year-ish ago in Sun's TCP
code.  It's rather complex, but let me try to explain it ...
 
When you open a TCP connection, a byte counter is assigned to it, which
is simply a copy of a system counter that ROLLS OVER at 2^32, or about
18 weeks.  Once the stream is opened, and the counter for that stream
initialized, that counter is incremented by one for each byt/octet
transmitted.  If your machine is up long enough, and you transmit
enough data, eventually that stream-specific counter rolls over (the
closer to that magic 18 +/- weeks, the less data it takes to get
there).  The RECEIVING TCP side DOESN'T roll over properly, so it fails
to recognize the packet after the rollover occurs, and asks for a
retransmit of the "right" packets.  With backoff algorithms, this
quickly settles down to near-silence.  Once the SYSTEM counter rolls
over, everything works fine again ... until the next rollover
approaches.


Many thanks to all who responded and to Mike Fischbein at Sun's Albany
office for mailing me the patch.


John Thier
GE Defense Systems Division
Pittsfield MA
thier@orcad2.dnet.ge.com