THIER@ORCAD2.dnet.ge.com (05/14/91)
My original question: > In attempting to run TCP socket transfers between two processes >residing in the same SPARCstation host, I am experiencing system >lock-up after some number of transfers have taken place (the number >of transfers varys; it's usually between 950 and 2850 with 30 ms. >intermessage spacing). Message sizes are 4496 octets. System >configuration is: > > SPARCstation 1+ > SunOS 4.1 > GENERIC Kernel. > 16 MB memory. > 210 MB disk. > 56MB swap. > Send/Receive Queues = 4096 The trophys go to the following who discerned that I was in need of the Loopback patch (100159-01): Hal Stern stern@sunne.East.Sun.COM Mark Plotnick mp@allegra.att.com Daniel Quinlan danq%chs@boulder.colorado.edu I was surprised to find that the loopback driver gets invoked even without using the loopback address, (Thanks again to Hal Stern): "when the IP layer sees dest IP == my IP, it gives the packet to the lo device driver instead of the ie or le driver." Other Responses: --------------- from Lixia Zhang lixia@parc.xerox.com Dont know if you've already got help from others. To me the system resources you ran out seem to be the ports. I believe you can only have a limited number of active tcp ports. When transmission is finished, the closing connection must wait for 2*T time period before the port being freed, where T is the max life time of pkt in the net (this wait period is for reliability reason, to make sure all previous pkts have dead before the port can be reused). You may check the appendix of RFC1185 to find out how long this T value is (if my memory is not faulty, I remember it is mentioned there). From: Mike Raffety mcnc!oddjob.uchicago.edu!oconnor!miker I don't suppose your transmitting host has been up a long time (e.g., 100-130 days), has it? I discovered a bug a year-ish ago in Sun's TCP code. It's rather complex, but let me try to explain it ... When you open a TCP connection, a byte counter is assigned to it, which is simply a copy of a system counter that ROLLS OVER at 2^32, or about 18 weeks. Once the stream is opened, and the counter for that stream initialized, that counter is incremented by one for each byt/octet transmitted. If your machine is up long enough, and you transmit enough data, eventually that stream-specific counter rolls over (the closer to that magic 18 +/- weeks, the less data it takes to get there). The RECEIVING TCP side DOESN'T roll over properly, so it fails to recognize the packet after the rollover occurs, and asks for a retransmit of the "right" packets. With backoff algorithms, this quickly settles down to near-silence. Once the SYSTEM counter rolls over, everything works fine again ... until the next rollover approaches. Many thanks to all who responded and to Mike Fischbein at Sun's Albany office for mailing me the patch. John Thier GE Defense Systems Division Pittsfield MA thier@orcad2.dnet.ge.com