[comp.protocols.tcp-ip] A 'horror story' for the books

david@ms.uky.edu (David Herron -- One of the vertebrae) (09/04/88)

Ok, maybe this one should have been obvious.  But if it hadn't been
for Doug Kingston giving me a short list of things to check over I
would've been a *lot* longer in finding the problem.  (*Thanks*!)

The short description is that a number of applications (ftp, smtp, etc)
stopped working between some of our machines after switching our
vaxen over to Ultrix v2.2 (from MtXinu 4.3bsd + NFS -- a step backward
if you ask me, but it's a looong story).  A question to the mmdf
mailing list (at the time I knew only that smtp didn't work) elicited
a reply from Doug that he thought it was more likely TCP/IP differences
and to check things like trailers and MSS ...  A check found that sure
'nuff we had some machines with trailers and some without.  Switching
off trailers on the interface made the applications work again.

I've got a couple of questions for the assembled experts:

1. Why did things continue to sort-of work between the conflicting
   machines?  I haven't looked at the code yet, but my understanding
   of the rfc is that ALL packets will be trailer-ified when going
   out a trailer link (or on 4.3bsd, out a trailer link AND when
   the host in question negotiated trailer use).  If ALL the packets
   were trailer-ified then the hosts would be seeing data where they
   were expecting header data and get all confused.
2. Why does Sun not recommend trailers?  Do they use a different
   page size than vaxen?  Or is it -- in general it's not good to
   use trailers on machines other than vaxen or it's not good to use
   trailers in a mixed environment?
3. Is there any financial aid and/or cheaper rates for a student
   who wants to attend Interop '88?

The following is the long version.  It's the report which I wrote up
for all the networking people on campus.



- Date:     Thu, 1 Sep 88 17:11:33 EDT
- From:     David Herron E-Mail Hack <david@ms.uky.edu>
- To:       uk-net-people@ms.uky.edu
- Subject:  trailers
- Message-ID:  <8809011711.aa06222@g.e.ms.uky.edu>
- 
- Oooo boy, the tiny things that'll cause problems ...
- 
- We've had a confusing problem over here since converting to Ultrix,
- that some of the programs would work to/from Ultrix machines and other
- times they wouldn't.  Like, an outgoing smtp connection would work fine
- until it sent out that trailing '.' whereupon it would hang.
- 
- Some asking around led to a suggestion to check trailers, MSS (Max
- Segment Size) and a few other options.  Some checking around in the
- code of the affected programs revealed no non-portable code which
- Ultrix broke.  Ultrix was, fortunately, enough alike (still) BSD that
- things worked as they did under BSD.  Albeit with an older technology
- of TCP/IP.  Eventually I ended up at the trailers suggestion.
- 
- What's a trailer?  Well, all it says in the manual page is some
- mumbling about changing the layout of IP packets to reduce the amount
- of copying that's involved.  They are documented in RFC893, and related
- rfc's are 984 & 894 which cover the details of doing IP across ethernet
- like mediums.
- 
- The trailer idea is to fix the size of the data portion of the IP
- packet at some multiple of the page size of your machine.  Since the
- idea was originally developed at UCB for 4.2, the size is 512 bytes or
- some multiple (The page size on a Vax).  The information which would
- normally be at the head of the packet (IP header information like
- to/from addresses, packet size & etc) are moved to the end and are now
- called 'trailers'.  There is also two other things added to the trailer;
- a protocol type field and a trailer length field.
- 
- Unfortunately they didn't do anything intelligent originally like
- negotiate use of trailers on a per host basis.  Instead trailers
- are either on or off on a per interface basis, and is done at
- boot time when ifconfig is run.  UCB's next version did do
- negotiation as part of ARP but in the meantime the 4.2 version
- of TCP/IP became part of many systems, many of which we have
- here on our ethernet.
- 
- Looking at the various manual pages I have access to:
- 
- 	4.3bsd		negotiable per host (default=trailers)
- 	WIN/TCP		non-negotiable (default=trailers)
- 	sun v3.4	non-negotiable.  also 'not recommended'
- 			because it's host dependant. (default=trailers)
- 	ultrix 2.2	non-negotiable (default=trailers)
- 
- Some of our machines had trailers turned off and some had them turned
- on.  Brian had thought it wasn't important because it was negotiated
- and turned them on ... oh well.  
- 
- One thing I'm not sure about is why things sort-of worked ... between
- two non-negotiating hosts which disagreed over the trailer issue there
- shouldn't have been *any* communication, because they disagree over
- where the 'header' information is to be kept.  Probably there is something
- else going on as well, but I'm not sure what.
- 
- For now we've turned off trailers on all of our machines.  Would the rest
- of you look into your configurations and tell me which ones can do trailers
- to begin with, and which ones can negotiate it.  (The negotiation is part
- of the ARP protocol).  This is another of those TCP/IP options which needs
- to be agreed upon across our whole ethernet.  er.. Well ... if someone were
- to have an IP gateway between their net and the campus net, they would be
- able to do what they want on their net.
- 
- Maybe we want to run with trailers on everywhere.  But we need to make
- sure that it makes sense for all the machines...
- --
- <---- David Herron -- The E-Mail guy                         <david@ms.uky.edu>
- <---- ska: David le casse\*'      {rutgers,uunet}!ukma!david, david@UKMA.BITNET
- <---- Problem: how to get people to call ...; Solution: Completely reconfigure 
- <---- your mail system then leave for a weeks vacation when 90% done.
-- 
<---- David Herron -- The E-Mail guy                         <david@ms.uky.edu>
<---- ska: David le casse\*'      {rutgers,uunet}!ukma!david, david@UKMA.BITNET
<---- Problem: how to get people to call ...; Solution: Completely reconfigure 
<---- your mail system then leave for a weeks vacation when 90% done.

rpw3@amdcad.AMD.COM (Rob Warnock) (09/10/88)

In article <10208@s.ms.uky.edu> david@ms.uky.edu (David Herron) writes:
+---------------
| ...  A check found that sure 'nuff we had some machines with trailers
|  and some without.  Switching off trailers on the interface made the
| applications work again. | I've got a couple of questions...
| 1. Why did things continue to sort-of work between the conflicting
|    machines?  I haven't looked at the code yet, but my understanding
|    of the rfc is that ALL packets will be trailer-ified when going
|    out a trailer link...
+---------------

Close. The trailer protocol is only used when the data portion of
the packet is an exact multiple of 512 bytes. The trailer protocol
actually uses a separate Ethernet type field value for each such
multiple of 512.

From 4.3's "vaxif/if_il.c" (the comment marked with [!] has a bug,
it says "first packet" when it should say "first mbuf"):

	/*
	 * Ethernet output routine.
	 * Encapsulate a packet of type family for the local net.
	 * Use trailer local net encapsulation if enough data in first
[!]==>	 * packet leaves a multiple of 512 bytes of data in remainder.
	 */
	iloutput(ifp, m0, dst) {
		...
		off = ntohs((u_short)mtod(m, struct ip *)->ip_len) - m->m_len;
		if (usetrailers && off > 0 && (off & 0x1ff) == 0 &&
		    m->m_off >= MMINOFF + 2 * sizeof (u_short)) {
			type = ETHERTYPE_TRAIL + (off>>9);
			...
		}
		...

This counts the "data" part of the packet only because the headers fit
entirely within the first mbuf, which happens (!) to be the case for all
protocols supported by the standard code ({UDP,TCP}/IP & XNS).

So... if you are doing something with short or odd-sized packets, like
a line-by-line Telnet or *very* small mail, you can still communicate
between a trailer and non-trailer implementation. Plus, you can always
send data from the non-trailer hosts *to* the trailer host, since <ACK>s
are small and thus never get trailerized.

In fact, you should have been able to watch your SMTP mail on a packet
monitor and seen the entire "HELO", etc., dialog go along just fine up
to the point that the trailer-using host blasted its first full-sized
packet at the non-trailer host... whereupon the trailer'd packet would
be periodically retransmitted until the connection timed out.


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

david@ms.uky.edu (David Herron -- One of the vertebrae) (09/13/88)

First, thanks to everyone who responded to my posting.  The consensus
was that trailers while on the surface seeming like a good thing are,
in practice, somewhat 'bad' and it's not even clear if they actually
help even in their native environment.  Plus there are cases (Sun's
especially) where they hurt performance because the idea is too Vax
specific and especially too specific to the Vax memory management.

In article <22891@amdcad.AMD.COM> rpw3@amdcad.UUCP (Rob Warnock) writes:
>In fact, you should have been able to watch your SMTP mail on a packet
>monitor and seen the entire "HELO", etc., dialog go along just fine up
>to the point that the trailer-using host blasted its first full-sized
>packet at the non-trailer host... whereupon the trailer'd packet would
>be periodically retransmitted until the connection timed out.

Well, being able to watch my SMTP mail on a packet monitor assumes the
presence of a packet monitor in the first place.  The closest I have
is tcpdump which, that I know of, does not display the contents of the
packet.  (The joys of living in a poor state at a University which isn't
yet fully up to speed on networking technology & hardware ....)

Anyway.  What I was seeing from the user level was the SMTP conversation
succeeding up to the point where the program had finished sending
all of the DATA section.  Then it went to send the '.' and hung either
in sending the '.' or waiting for the response (depending on the phase
of the moon, I think).  Now possibly the DATA section was being buffered
as much as possible, I don't remember the code that well.  Certainly it
looked to me (at the time) as if the code were hanging because of a
short packet rather than a long one...
-- 
<---- David Herron -- One of the MMDF guys                   <david@ms.uky.edu>
<---- ska: David le casse\*'      {rutgers,uunet}!ukma!david, david@UKMA.BITNET
<---- 				What does the phrase "Don't work too hard" 
<---- have to do with the decline of the american 'work ethic'?