netser%limbo.uci.edu@icsg.UCI.EDU (Richard Johnson) (12/18/86)
One of our Computing Support people here (Scott Menter) noticed a strange problem today. We investigated and we don't know exactly what to make of the situation. Let me explain: 1) You rlogin from your sun workstation (Sun-3/50 in this case) to another system on the network. 2) Your sun workstation crashes. 3) After rebooting you try to rlogin to the same other system again and you can't even after multiple tries. We investigated and found that Sun seems to always allocate the first unused port number above 1021 for an rlogin connection. Since the other end of the rlogin will stick around until some I/O forces it to recognize the connection is broken (we just cat'ed to the pty on the remote system and it closed), you'll get the same hosta:porta - hostb:portb pair EVERY time and that HAS to be rejected by the remote system because of the way TCP connections are defined! Of course all you have to do to work around it is just rlogin to some OTHER system and then rlogin to the one you want! Is this a bug? Am I missing something? (By the way, this is SUN 3.0.) ---------------------------------------------------------------------------- Richard Johnson netser@ics.uci.edu (Internet) UCI ICS Network Services ...!ucbvax!ucivax!netser (UUCP)
rackow@anl-mcs.arpa (Gene Rackow) (12/18/86)
We have the same problem on our network. The work-around that I have found that works is- 1. for the sun do a "rlogin machine &" and while that is timing out 2. do another "rlogin machine" this one will now get into the remote machine 3. do a "who" on machine to find who is the ghost user on hostb 4. kill the login shell of the ghost user. From here, until the next crash, rlogins work properly. I have heard rumors that this problem is corrected in 4.3bsd and/or Sun 3.2. Can anyone confirm/deny this rumor? Gene Rackow rackow@anl-mcs.arpa 312-972-7126
narten@purdue.EDU (Thomas Narten) (12/18/86)
This may be a feature of Sun UNIX, but is probably not restricted to it. It is caused by two problems: 1) Unix has a keepalive option on sockets that times out (breaks) connections if the peer in connection goes away. For TCP, "going away" is defined as not having recieved any packets from the peer in X amount of time. Rlogind uses this option. 2) Sun diskless machines reboot much more quickly then normal Unix machines, because they don't have large disks for fsck to churn away on. In particular, they are back up and running before old connections have timed out due to (1). 1 is implemented by running a timer that expires whenever no packets have been exchanged for a certain period of time. When the timer expires, TCP sends a one byte data segment that is outside of its send window (i.e. it already has an ACK for that sequence number). The peer TCP, in receiving the segment, notes that it already has the data and sends back an ACK for the sequence number that it expects to see. The client TCP gets that ACK, and updates its timer indicating that the connection is still alive. The connection eventually breaks if no ACKs are received. This works just fine as long as both TCPs are still there, or if one end of the TCP connection goes away in the sense that the host is unreachable. On the other hand, if one machine crashes and reboots quickly, the following occurs: The client TCP sends a keepalive packet, which the peer TCP receives. Now however, there is no protocol control block for that connection, so the peer TCP sends back a RESET. The client TCP receives the packet, updates its keepalive timer (hum... I got a packet, the connection must still be fine), then checks the sequence numbers that were ACKed. The ACK is outside of its receive window and there was no data sent in the segment, so TCP drops the packet ignoring the RESET. (This follows the TCP spec). >Since the other end of the rlogin will stick around until some I/O >forces it to recognize the connection is broken (we just cat'ed to the >pty on the remote system and it closed), This results from the RESET being ignored since it is not within its recieve window. If you force the TCP to send real data, the ACK that gets returned will be within the receive window and the RESET causes the connection to break. One workaround is to change the line in tcp_input(...): tp->t_timer[TCPT_KEEP] = TCPTV_KEEP; to something like: if ((tiflags&TH_RST) == 0) tp->t_timer[TCPT_KEEP] = TCPTV_KEEP; This will cause the connection to eventually timeout. Both 4.2 and 4.3 BSD suffer from this problem. >1) You rlogin from your sun workstation (Sun-3/50 in this case) to another > system on the network. >2) Your sun workstation crashes. >3) After rebooting you try to rlogin to the same other system again and > you can't even after multiple tries. I tried to duplicate your behavior on our Sun machines running NFS3.2 trying to connect to 4.2, 4.3 and NFS3.0 machines. I don't have a 3.0 machine handy that I can crash at will. I would rlogin to host A, reboot the workstation, and rlogin to A again. Each time, I was able to rlogin successfully. Each connection used the same port numbers. Note that under normal conditions, the following packet exchange takes place: A B send SYN, SEQ=n,ACK=0 (thinks connection is established) gets SYN, sends back ACK=m,SEQ=o gets ACK, notices sequence number is not what it expects & replies with: ACK=0,SEQ=m,RESET gets RESET,drops connection and sends back RESET,ACK=m,SEQ=o At this point the "old" rlogin has gone away, and the next SYN will cause the connection to become established properly. I suppose that things could break if the sequence number chosen by A was the same as B was expecting, but that would be an awful coincidence. It is the case, however, that when a machine reboots, it starts with an initial sequence number of 0. If your machine crashes several times in quick succession, it is possible that the sequence numbers on the peer connection could also be very low. Still, I find it hard to believe that this is the cause the problem. Do you have anyway of determinig what sequence numbers are involved in the connections or what sort of packets are floating around for the connection in question? Thomas Narten narten@purdue.EDU or {ihnp4, allegra}!purdue!narten
thomson@uthub.toronto.edu (Brian Thomson) (12/22/86)
I submitted a fix to this problem to net.bugs.4bsd in July, 1985. The TCP connection establishment protocol is supposed to recover from these 'half-open' connections, but a problem with the 4.2BSD implementation prevented it from working properly. 4.3 has apparently adopted the same fix I proposed, although because of interaction with other BSD TCP bugs I no longer use it in its original form. Presumably, later SUN distributions made a similar fix. -- Brian Thomson, CSRI Univ. of Toronto {linus,ihnp4,uw-beaver,floyd,utzoo}!utcsri!uthub!thomson