[comp.sys.next] buggy tcp interactions

mdixon@thelonius.PARC.xerox.com (Mike Dixon) (08/23/89)

i spend a fair amount of time rlogin'd from my next to a nearby sun,
and occasionally get into problems where the connection seems to go
"out of sync" -- the last packet worth of output isn't displayed until
i type another character or a couple of seconds goes by (making it
feel as if the sun has suddenly become rather slow).

this afternoon when it happend i tcpdumped from another machine, and
looking at the trace it's clear that the next is getting into some
mode where it simply doesn't ack packets.  when this happens, there
doesn't seem to be any reasonable way to fix it (e.g. logging out of
the sun and back in again doesn't help; presumably rebooting the next
would).  on the other hand, it usually fixes itself after five or ten
minutes.

anyone have any idea what's going on here?
------------------------------------------------------------
watcher# tcpdump between sun next
# i type 'ls' followed by cr
17:55:41.81  next.1023 > sun.login: P -1653878447:-1653878446(1) ack 675399367 win 4096
17:55:41.83  sun.login > next.1023: P 1:2(1) ack 1 win 4096
17:55:41.89  next.1023 > sun.login: P 1:2(1) ack 2 win 4096
17:55:41.89  sun.login > next.1023: P 2:3(1) ack 2 win 4096
17:55:42.01  next.1023 > sun.login: P 2:3(1) ack 3 win 4096
# it echoes the cr, then sends the output.  the next displays the
# cr/lf immediately but not the ls output, and acks neither
17:55:42.01  sun.login > next.1023: P 3:5(2) ack 3 win 4096
17:55:42.71  sun.login > next.1023: P 3:597(594) ack 3 win 4096
# the sun times out & resends, this time the next acks
17:55:44.71  sun.login > next.1023: P 3:597(594) ack 3 win 4096
17:55:44.73  next.1023 > sun.login: . ack 597 win 4096
# i just type a cr
17:55:48.61  next.1023 > sun.login: P 3:4(1) ack 597 win 4096
# this packet is displayed immediately, but not acked
17:55:48.61  sun.login > next.1023: P 597:599(2) ack 4 win 4096
# eventually the sun gives up and sends it again, aloong with the
# the prompt
17:55:52.21  sun.login > next.1023: P 597:606(9) ack 4 win 4096
# and the next *still* doesn't echo until i type another cr...
17:55:56.83  next.1023 > sun.login: P 4:5(1) ack 606 win 4096
# more of same
17:55:56.85  sun.login > next.1023: P 606:608(2) ack 5 win 4096
17:56:00.71  sun.login > next.1023: P 606:615(9) ack 5 win 4096
17:56:03.99  next.1023 > sun.login: P 5:6(1) ack 615 win 4096
17:56:04.01  sun.login > next.1023: P 615:617(2) ack 6 win 4096
17:56:08.71  sun.login > next.1023: P 615:624(9) ack 6 win 4096


--
Mike Dixon                  Xerox PARC              mdixon@arisia.xerox.com

gerrit@nova.cc.purdue.edu (Gerrit) (08/23/89)

In article <MDIXON.89Aug22182134@thelonius.PARC.xerox.com> mdixon@thelonius.PARC.xerox.com (Mike Dixon) writes:
>i spend a fair amount of time rlogin'd from my next to a nearby sun,
>and occasionally get into problems where the connection seems to go
>"out of sync" -- the last packet worth of output isn't displayed until
>i type another character or a couple of seconds goes by (making it
>feel as if the sun has suddenly become rather slow).

This is a problem with the NeXT version of TCP/IP.  We reported the
bug about the time the cut the 0.9 disks, but the bugfix didn't
make it to the 0.9 distribution.  The bug is fixed in 1.0.

The "solutions" (haha) are to either reboot or wait it out.  Think of
it this way - it's incentive to convert to 1.0 as soon as you get
the upgrade kit.  :-)

Gerrit Huizenga
NeXT Workstation Support
Purdue University Computing Center
gerrit@mentor.cc.purdue.edu

abe@mace.cc.purdue.edu (Vic Abell) (08/23/89)

In article <MDIXON.89Aug22182134@thelonius.PARC.xerox.com> mdixon@thelonius.PARC.xerox.com (Mike Dixon) writes:
>i spend a fair amount of time rlogin'd from my next to a nearby sun,
>and occasionally get into problems where the connection seems to go
>"out of sync" -- the last packet worth of output isn't displayed until
>i type another character or a couple of seconds goes by (making it
>feel as if the sun has suddenly become rather slow).

We reported this problem to NeXT some time ago and were told that it is
the fault of a TCP/IP bug in the kernel and will not be fixed until the 
1.0 release.  Our analysis suggests that the NeXT TCP/IP is employing 
delayed ACKs - wherein an ACK is not sent immediately but is queued for
possible piggyback on a following transmission, or after a timer expires
and forces its solo transmission.  Without kernel sources we were unable
to determine the exact nature of the kernel bug.

The current, unpopular workaround most of us use is to type ^V.  That
forces the NeXT to send a packet to the remote peer in which the ACK can
ride.  Often it helps to reboot the NeXT work station (another, even more
unpopular workaround).