[net.lan] Sun TCP bug

randy@mc0.UUCP (randy nerwick) (03/22/86)

We have a Sun 2/170 running 2.2 equipped with a Sun ethernet board which
seems to have a problem when receiving a stream of packets from another
machine.  The advertised window becomes very large (> 60000) about the
same time a packet is missed.  The window starts out at 2048 and usually
decreases before becoming large.  The Sun also occasionally misses packets
and sets the window to zero well before the window has been filled.  

The problem originally showed up as very poor throughput while running
performance tests.  With a Sun 2/170 running 1.4 and a 3Com ethernet board
transmitting a stream of 1000 messages of a fixed size to the Sun running
2.2, many re-transmissions were occurring at small message sizes, particularly
128 byte messages.  The problem also showed up with larger sized messages
which were split into a combination of large and small packets for
transmission.

The problem also shows up with a Vax 750 running VMS 4.1 and an Excelan TCP/IP
package transmitting to the Sun running 2.2, although the throughput is
affected much less than with the two Suns.  We have seen the same problem
with a Sun 2/160 running 2.0 receiving from the Sun running 1.4.

Has anyone else observed these problems?
-- 
Randy Nerwick
Uucp: {sdcrdcf,ihnp4,bellcore}!psivax!mc0!randy

msd@nrcvax.UUCP (Marc Dye) (03/23/86)

<<something tells me I should put some garbage here>>

The article referenced referred to three pieces of information which
correlate well enough to symptoms we have seen to merit at least a
mention of ours.  First, a noticeable functional difference between
an older Sun configuration (at least from a networking standpoint).
Second, a difference in Sun OS version, the potentially problematic
one being 2.2.  Third, a difference in Ethernet controller, the
potentially problematic one being Sun's own (vs. the 3Com 3C400).

As a point of reference, NRC sells FUSION, a networking product which
runs on various operating systems, link layer technologies, CPU
configurations, and provides various protocols (including TCP/IP).
The nadir of the hardware venue is a 3Com 3C500/3C501 controller
which has a single receive/xmit buffer; this renders it deaf to
the network for non-trivial amounts of time.  This 'feature' also
exercises some of the finer points of network implementations.
One of the uglier ones is the 4.2BSD dynamic retransmission / RTT
algorithm, but I digress....

Networking implementations based on 4.2BSD are all somewhat different
it seems.  We had (courtesy of Sun via their Catalyst program) a
loaned system; this was a Sun model 150 (one of the old square black
variety with the battleship-type keyboard).  The 150 was Multibus-
based, used the 3Com 3C400 Ethernet controller, and ran a *really old*
version of the Sun 4.2 O/S (1.0 I think).  We recently took delivery of
a new Sun 2/130 which has an integral Sun-designed Ethernet controller.
It came with a distribution of Sun 4.2 O/S version 2.0 and an upgrade
for version 2.2.

For some time we had been getting field reports of customers having
peculiar problems with various incantations of 4.2 networking.  What
was most surprising was that some of these were Sun workstations,
since our software had cut its teeth on that variety.  We could
never reproduce these problems on the Sun 150 we had.  They all
all got delivered to us the day we got the new Sun 2/130.

The first problem scenario relates to the Sun getting confused
about what data has been acknowledged from the remote host.  This
correlates to a most peculiar sequence of packets generated by
the Sun, which weren't generated by the older Sun 150.  The
following is an excerpt from an NRC field technical report:

"The most noticable symptom of this problem is a hung connection.
This problem occurs as a function of inability of Ethernet
controllers to receive 100% of their network traffic.  In other
words, the dumber the controller, the more this is likely to
happen.  The most severe case I've seen is on an IBM XT
w/ a 3C500 controller; 'telnet' to the Sun and then a 'vi'
of a non-trivial file usually can't get through two full
screen paints before hanging.  At this point, you can still
type the 'telnet' escape character (usu. '^]'), do a 'close'
and the local host seems ok.  Actually, the connection may
still be lying around *forever* consuming a socket.

Analysis of the network traffic shows something like:

0)- login, get your fortune, blah, blah, ...
1)- ask 'vi' to paint a whole screen
2)- PC has an open window of 1024 bytes
3)- Sun (for some strange reason) sends *two* packets: 1023 bytes then 1 byte
4)- PC misses the second (1 byte) packet (and hence fails to acknowledge
	that byte in the stream)
5)- Sun (for some strange reason) presumes that the 1 byte packet *has*
	been acknowledged; this leaves the connection in a permanently
	desynchronized state, since the PC won't let the Sun go ahead
	since (as far as the PC is concerned) Sun hasn't ever sent
	the 1 byte, yet the Sun will never retransmit the 1 byte since
	it thinks it was acknowledged (and has probably destroyed it's
	copy of that byte)
6)- 'telnet' close at this point succeeds in the PC->Sun direction
	since that direction is still synched up; usually this
	causes a retransmission of whatever will fit in the now
	empty (1K) window *except that one damn byte*; these
	retransmissions never succeed and eventually the Sun will
	reset the connection thinking the PC dead;  if the PC
	catches the reset, the socket will be liberated, otherwise
	it will live forever
"

Note also that even though the PC represents the worst case (we have
around), this behavior will eventually occur on all of the systems
we have.  We don't (unfortunately) have two Sun's to try it between.
Someone out there who does and is interested can send me some
mail and I will send some test scenarios to try.

The second problem scenario has to do with offering the Sun a
1023 byte window (again from the same NRC report):

"
The <customer X> problem had some similar trappings.  This time it was
FTP which was unable to receive certain files in ASCII mode from
a Sun.  On investigation, it proved to be the case that the
problem was that (under certain data conditions), our FTP was
asking for 1023 bytes of data rather the usual 1024.  This seemed
to hose the Sun right away as he promptly sent a malformed but
effective reset packet.  ...  Note again that I tried this test with
these same files with the old Sun and it's O/S and it did not
fail.  The new Sun does the same as the one at <customer X>.
"

In this case, changing the maximum offered window made the problem
go away (i.e. it's the value 1023, not oddness or something).  In the
first case, varying the window size didn't seem to matter in the long run.

The "malformed but effective reset packet" contained a SYN flag
and a TCP maximum segment size option with a maximum size of
0 bytes, in addition to the RST flag and appropriate sequence
numbers.

Neither of these problems existed in the old Sun 150 implementation.

To paraphrase David Plummer: the world is a jungle and networking
contributes many animals.


Marc S. Dye		Vice President, Research and Development
			Network Research Corporation
via Eventual Express -> 923 Executive Park Drive  Suite C
			Salt Lake City, UT  84117  U.S.A.
		  or -> 2380 Rose Avenue
			Oxnard, CA  93030  U.S.A.
via 'N' Bell Systems -> (801) 266-9194   or  (805) 485-2700
          via USENET ->	ihnp4!nrcvax!msd
         		{hplabs,sdcsvax}!sdcrdcf!psivax!nrcvax!msd
			ucbvax!calma!nrcvax!msd
            ARPANET  -> calma!nrcvax!msd@UCBVAX.BERKELEY.EDU
+----------------------------------------------------+
|   *BADGES*?  WE DON'T NEED NO STINKIN' BADGES!!!   |
+----------------------------------------------------+