[comp.protocols.tcp-ip] TCP Fletcher Checksum Option

zweig@brutus.cs.uiuc.edu (Johnny Zweig) (11/30/89)

In re-reading the article ``Improving the Efficiency of the OSI Checksum
Computation'' in the latest Computer Communication Review (ACM SIGCOMM),
I remembered an idea that popped up over lunchtime several months ago.

It seems to me that I could add 4 or 5 lines of code to my TCP-checksum
computation routine (basically a 16-bit-at-a-time iteration over the
appropriate bytes) and have it compute both the regular TCP/IP checksum
and a 16-bit Fletcher checksum (by simply using an additional accumulator)
at a small additional cost (well, twice as many additions, so it's not really
such a small cost, I'm sure it will be said -- see below).

I would like to hear opinions on whether a TCP header option-pair (one to
negotiate use of such a Fletcher checksum on connection-establishment, and
one to carry the actual extra 2 bytes in each segment) would be a
reasonable thing to propose.

Not to slam Postel and the whole IETF and everyone else who brought us the
TCP/IP checksum algorithm, but the fact that it is not able to detect
the transpostion of N (where N>=2 is an even number) octets in a datagram
is of concern in some applications where ultra-reliable communication is
essential.  Also, it has been pointed out by a number of hardware-types I
have talked to that these sorts of errors can occur, for example, when
DMAing bytes from an ethernet-controller to the main memory of a machine --
i.e. after the ethernet CRC has already been checked. One case it to have
a pair of 32-bit words go out over the bus in the wrong order under rare
timing glitches....

I know, I know, there is probably a stack of printout somewhere gathering
dust (being from 1967) giving good data about the fact that this Almost 
Never Happens in Real Life -- but it seems like an option that most
implementations would not support (I think this ought to be an "optional"
feature in the hypothetical revision of the Host Requirements RFC that gets
put out after the hypothetical RFC describing this option is hypothetically
released) would not cost anything to anyone, other than to the implementations
that feel it is appropriate to use it.  The added security and assurance of
reliable communication seem to me to be (a) an easy option to add to the
protocol-suite, (b) demonstrably sufficient for ultra-reliable applications
(the 16-bit Fletcher algorithm is very good at detecting errors -- see
pp.35/36 of CCR v.19 no.5, etc.), and (c) would not cost anything not to use,
since existing option-processing would reject any attempts to use the
option if unsupported (i.e. doing vanilla TCP would not be changed at all,
and for a Fletcher-talking version to realize it can't talk Fletcher to
a site would only entail the transmission of a couple of packets; presumably
it would be very rare to try to talk Fletcher to someone who isn't known
in advance to support it).

So it's cheap, it's simple, it's reliable, it's easy to code. Why shouldn't
it be a TCP option?

-Johnny Reliable

haverty@BBN.COM (Jack Haverty) (11/30/89)

>Not to slam Postel and the whole IETF and everyone else who brought us the
>TCP/IP checksum algorithm, but the fact that it is not able to detect
>the transpostion of N (where N>=2 is an even number) octets in a datagram
>is of concern in some applications where ultra-reliable communication is
>essential.  Also, it has been pointed out by a number of hardware-types I
>have talked to that these sorts of errors can occur, for example, when
>DMAing bytes from an ethernet-controller to the main memory of a machine --
>i.e. after the ethernet CRC has already been checked. One case it to have
>a pair of 32-bit words go out over the bus in the wrong order under rare
>timing glitches....

I thought that the following snatch of history might be of interest.

I remember pretty clearly almost exactly this discussion at one of the
TCP working group meetings ten years ago.   There were two arguments
against having a more powerful checksum:

1/ I don't want to devote that much of my CPU to it (remember CPUs in
those days meant big, expensive, not-very-powerful timesharing systems)

2/ if we have to do complicated arithmetic (i.e., algorithms for which
the CPUs didn't have real good instructions available), it will make it
difficult to get high bandwidth

If I remember correctly, the telling argument was:

3/ People are working on new chips to do checksums in silicon.  We don't
want to adopt an algorithm that will preclude use of these chips.

The problem was that there was no obvious consensus to which chips would
hit the market when, and when computer manufacturers would put them into
their products.   This led to a basic plan which was:

Put in a rudimentary checksum now.   When CPU cycles become cheaper,
chips become available, and/or the drawbacks of the checksum become a
problem, select and standardize a new checksum.

If anybody else who was there remembers this differently, chime in!  

Maybe it's now the time....

Jack

baker@VITAM6.UUCP (Fred Baker) (11/30/89)

That's actually an interesting idea.  BUT, why not simply reuse the TCP
checksum field?  If the use of the new checksum has been negotiated, then
the standard checksum is useless except as a component of the Fletcher
Checksum, and any implementation which negotiates to use Fletch will by
definition understand the change.

I disagree, however, that TCP will only use this on folks known to support it.
I think it will always ask and sometimes get a positive response.  Otherwise,
you need some kind of extra-architectural configuration table.

Fred Baker
baker@vitalink.com

craig@NNSC.NSF.NET (Craig Partridge) (11/30/89)

Johnny:

    If you are seriously interested in seeing such an option become
available, it is pretty easy to get the process started to standardize it.

    Given this is a very simple thing to specify (it is a one page
RFC describing the option format) I suggest you write up the one page
RFC for a TCP checksum-type option sent in the SYN (I'd suggest allowing
other checksums besides Fletcher's to be specified).  Make sure that
when a host that doesn't recognize the option silently ignores it,
the TCP connection will continue to use the TCP checksum.

    Then send that draft to me and I'll arrange a meeting at the next
Internet Engineering Task Force meeting (in February) to review it.
If it's approved at that meeting, there's a good chance we can roll it out
as an "if you are going to do this, do it this way" option sometime in
mid-1990.

Craig Partridge
IETF Area Director, Host-Based and User Services

PS: IETF meetings are open to anyone who wants to attend.  The next
IETF meeting is in Tallahassee Florida in February.  To keep track of
what's up in IETF, I suggest you join the IETF list (ietf-request@isi.edu).

jas@proteon.com (John A. Shriver) (11/30/89)

Actually, rather than have to negotiate the TCP checksum type, one
could just assign a different IP protocol number to "TCP with
Fletcher".  If you get an ICMP protocol unreachable, then you go back
to the standard TCP.  Of course, then you might open a can of protocol
revision worms, like the window size to 32 bits...

Of course, this probably wouldn't work so great with the Berkeley
sockets programming interface.

dcrocker@DECWRL.DEC.COM (David Crocker) (12/01/89)

Jack,

I have always had the impression that the TCP and IP checksums were 
intended ONLY as a backstop for the link-level checksum.  The theory,
as I have understood it is:

If there is any interesting likelihood for data corruption, use strong,
hardware-based checksum and retransmission mechanisms.  This is best done
at the data-link layer.  For example, such corruption problems usually
involve noise and are localized to a given segment; use mechanisms specific
to that segment.  Further, you can get more immediate response, since you
do not incur the end-to-end delay before detecting and then retransmitting.

The TCP and IP-level mechanism then becomes the safety valve, to catch the
errors that slip in between the data-link interface cracks.

This, then, justifies the requirements that the TCP and IP checksums be
end-to-end and cheap in software.  

Hence, the question is how likely transposition is for these in-between
the cracks points of exposure?

Dave

CERF@A.ISI.EDU (12/10/89)

Dave,

times are changing. The kinds of corruption we once  fought: line noise,
are being replaced by packet loss due to congestion or slips and
peculiarities which David Tennenhouse (LCS/MIT) warns may be visited
upon us by improperly implemented or functioning ATM switches: packet
internal reordering at the cell level.

It is by no means clear that reordering at the cell level will be
detected by the ATMs or by links level algorithms sending the
packets assembled from cells to the hosts since the link level
checksums would be recomputed AFTER reassembly, most likely.

At any rate, considering the question of integrity checking
in the current and anticipated internet environment seems
timely.

Vint

goldstein@delni.enet.dec.com (12/12/89)

In article <[A.ISI.EDU].9-Dec-89.11:15:42.CERF>, CERF@A.ISI.EDU writes...
>Dave,
> 
>times are changing. The kinds of corruption we once  fought: line noise,
>are being replaced by packet loss due to congestion or slips and
>peculiarities which David Tennenhouse (LCS/MIT) warns may be visited
>upon us by improperly implemented or functioning ATM switches: packet
>internal reordering at the cell level.

>It is by no means clear that reordering at the cell level will be
>detected by the ATMs or by links level algorithms sending the
>packets assembled from cells to the hosts since the link level
>checksums would be recomputed AFTER reassembly, most likely.

Indeed, the current proposals for B-ISDN use ATM cell transfer and
state, in its service description, that it _will_ preserve order.  Now
doesn't that make you confident that the implementations will never,
ever, ever, ever mis-order a cell?  The ATM cells do not contain any
sort of sequence numbers. 

The proposed "adaptation" protocol, the bottom of Layer 2 (providing the
framing and error detection services, but not retransmission) takes
data packets and splits them into cell-sized segments.  These are 
labeled first/middle/last/only and a total length indicator is stuck on 
the tail.  If each 48-octet cell's 10-bit CRC checks, and the total
number received from first to last causes a correct length indicator
reading, then you've got a valid cell.  Note that this does detect
cell drop and spurious insertion, but not misordering.  Remember also
that they promise that the network won't misorder, so they don't 
recommend a cell sequence number...

BTW I have proposed (and gotten rebuffed by the Powers That Be at 
T1S1.5, namely the AT&T & Bellcore leadership) an end-to-end datalink
protocol that has cell sequence numbers and bulk cell ack's, loosely
based on NETBLT, with a 2^14 modulus.  You can always run it or 
something else you prefer end to end across the ATM and ignore their
recommendations for adaptation, if you're not talking to a
network-provided service (i.e., the "B-ISDN CLNS", which is an L3
datagramme service running over the adaptation layer).  My impetus,
though, is to handle cell loss without having to retransmit the whole
frame.  (Can you spell 'congestion collapse'?)  Misordering detection 
comes along with that.

>At any rate, considering the question of integrity checking
>in the current and anticipated internet environment seems
>timely.

Very true...  we now have to worry about detecting cell misordering,
etc., if we don't trust the telco networks to be perfect.
       fred