cheriton@PESCADERO.STANFORD.EDU (David Cheriton) (12/02/86)
Dear TCP-IPers:
  I have been working on a request-response transport protocol (VMTP) for
possible adoption in the Internet.  This work is being done as part of my
participation in the IAB End-to-end task force, chaired by Bob Braden.
  Bob has asked me to include this mailing list in on some discussion of
design choices for checksums. Here is my side of the story.
                  VMTP Checksum Controversy
There appears to be three basic issues re VMTP Checksum, or more generally
a checksum for the next generation of transport protocols, namely:
1) what is the size of the checksum field.
2) the algorithm should be used.
3) Where in the packet should the checksum be located.
I spent some time talking with John Gill and Marty Hellman  of the Stanford
Information Systems Lab. as well as thinking about it and some experimentation.
(I also got some good feedback from others in my group, including Steve
Deering, Michael Stumm, Ross Finlayson.)
1) Checksum Size
The checksum is a probabilistic device, clearly.  There are an enormous
number of packets that map to the same checksum for either 16 or 32 bits of
checksum.
Let R be the packet rate (on average) in packets per second.
Let P be the probability of a packet corrupted that is undetected
by the lower levels, per packet.
As long as P is non-zero, there is some time range T at which
T*R*P becomes quite large, i.e. we have to expect some packets
that are erroneous being delivered by the IP level. I'll refer to these
as nasty packets.
If we view the corruption as randomly assigning the packet a new checksum
from the space of possible checksums, it is clear that, with a 16-bit
checksum, there is a 1 in 65,536 chance that the corrupted packet will have
the same checksum.  With a 32-bit checksum, it is one in 4 billion.
The key ingredient we expect to dramatically change over the next 5 years
is the packet rate, for channels, switching nodes and gateways.
It is reasonable to expect a significant increase in traffic, number of
gateways as well as packet rate per gateway. Just to play with some numbers,
let us assume by 1991, 500 gateways each handling 200 per sec. with probability
10**-7 of packets with errors undetected by lower levels (i.e. combination
of all hardware and software from one endpoint to another.) means that
there are 36 packets per hour with undetected errors at lower level.
Assuming random mapping on to checksums, with 16 bit checksum, one might
expect a packet whose error is not detected by the transport-level checksum
every 75 days or so.  Maybe this is OK?  However, suppose these numbers are
low?  For instance, we occasionally gets bursts of packets at 2000/sec on
our Ethernet.  The future parameters and use of communication could easily
push the time between undetected errors down by an order of magnitude
(7 days?)
Of course, one could argue that such a packet may still fail the addressing
checks, the sequence number checks, etc. in the protocol.  However,
our measurements of VMTP indicate 1/2 as being minimal size packets and 1/4
as maximal size packets (for the Ethernet at least).  With small packets,
the probability of random mods to data versus header is .5 whereas for
for large packets it is .95 for the data.  Thus, this argument reduces the
risk by far less than a factor of two. Moreover, it is disquieting to fall
back on these random checks to help us when we expect the "real" error
detection mechanisms is going to fail.
One might also argue that my assumed error rate/per packet is too high.
However, the probability of correct operation is the PRODUCT of
the probabilities for the sequence of components involved in a packet
transmission and delivery - so this error rate could get large fast.
Moreover, one requires a more careful analysis to get a tighter bound.
Can anyone offer a significant improvement in bounds without making
some disconcerting assumptions??
In the same breath, one could argue that my analysis is far too optimistic.
For instance, if the checksum algorithm uniformly maps all possible
packets onto all possible checksums, then the fact that normal packets
may be dramatically skewed to particular types, such as ASCII text
packets to and from an important server, may mean that far fewer
corruptions will be detected.  In particular, a systematic hardware error
that corrupts a packet into a nasty packet (with correct checksum) is
far more likely to encounter the same packet (or similar enough)
than if random packets were being generated.
This "analysis" deals with pleasant, sunny day expected behavior.
A more frightening situation is dealing with burst errors from failing
components.  For instance, suppose a gateway or network interface fails
by curdling random portions of packets.  We could see 200 undetected
(below transport level) errors per second. Again, assuming random
mods, with a 16-bit checksum we can expect to get a corrupted packet
that passes the transport-level checksum every 5.5 minutes,
versus every 8.2 days with a 32-bit checksum.
If we consider a military situation where components may get wounded in action,
it is easy to conceive of several things generating garbage at the worst
possible time, given a very high exposure using a 16-bit checksum.
The 32-bit checksum appears to give a needed marginal of safety,
especially a variety of factors could cause one to effectively loss a
factor of 2 or more.  (As John Gill raised, are you sure 32-bits is enough?!)
Overall, after studying this issue in brief and thinking thru the above,
I am even more convinced we need to get to at least a 32-bit checksum.
The increased expected packet rate is the key thing that is making 16-bit
checksums unacceptable.
2) Checksum algorithm
There appears to be a deficiency of theory behind checksums that are suitable
for software implementation.  Calculating 32-bit CRC's is infeasible - either
too slow or requires an enormous lookup table.  The ideal algorithm is:
i) very fast to calculate and check in software.
ii) can be shown to detects common types of packet corruption including,
iii) can be shown to be unupdateable "thru" an encryption scheme,
  I.e. intruder cannot update (encrypted) checksum field to compensate for
   modifications to other encrypted data.
Lots of schemes can be proposed but the more complex, the more difficult it is
to analyze with respect to (ii) and (iii).
The current algorithm, assuming we extend it to a 32-bit version
does not detect appending or inserting 0's in a packet
or the reordering of 16-bit (or 32-bit) units. These seem like not unlikely
failures in a large internetwork with gateways or different byte orders
and flakey hardware, etc.
Another candidate is "add-and-rotate" as used by RDP, plus suggested
(without guarantee) by John Gill. It also has the property that any single
32-bit word mod cannot leave the checksum calculated correctly (also true of
the TCP algorithm). 
The "rotate" provides some detection of zeros and reordering.  However,
clearly there are occasions that it will fail to detect reordering of
32 long word blocks, or insertion of 32 long blocks of zeros.
(It also requires some assembly language in most languages for efficient
implementation.) Other properties are not clear to me.
Another algorithm is the ISO transport algorithm, which is easily extended
to 32-bits or 64-bits. I.e. R = Sum(1..L) Wi and S = Sum(1..L)(L-i+1)Wi
where Wi is the ith (32-bit) word and there are L bits in the packet.
According to Fletcher's analysis (See ref. below.), this algorithm
detects all single bit errors and detects all but .001526 percent of all
possible errors, same as CRC.  It also detects all burst errors of length
C bits, where C is the number of bits in the checksum. Fletcher claims:
"it represents a very reasonable tradeoff between effectiveness and efficiency".
I did some basic performance tests on these algorithms.
The following gives the performance for the three algorithms for checksumming
a VMTP packet with 1024 bytes of data (assumed to be long word aligned).
Algorithm             MC 68010 (10 MHz)    MC 68020 (16 MHz)
add-and-rotate              916 usec                 320 usec.
ISO Transport (32-bit)      914 usec                 336 usec.
ISO (16-bit)               1245 usec                 446 usec.
TCP                         890 usec                 323 usec.
The TCP figure is really not too accurate - it lacks a test for carry on
each 32-bit add. The first ISO version uses 32-bit adds to develop the two sums
using 2's complement arithmetic (and then I just store the sum of the two
sums - a proper scheme here would require 64-bits for the checksum field.)
(Fletcher notes that the 1's complement version has slightly better properties
but is more expensive to compute.)
The second one uses 16-bit adds, which means a register swap is necessary as
well as doubling the number of adds required. The results here appear to
contradict Jon Crowcroft's comments on performance.  Perhaps his factor of
4 comes from a poor implementation??
The ISO 32-bit version and the TCP version are in (portable) non-asm C,
whereas the others are done in assembly language. An assembly version
of ISO 32-bit appears about 80 usecs faster on a 68020.
All these algorithms have the property of requiring only one
memory reference (to get the 32-bit byte out of the packet) per word
in the packet, which is the minimal memory reference cost, given 32-bit
data paths.  The numbers seem to indicate that a small number of instructions
per long word (that do not increase # memory references) make no significant
difference in performance of the algorithm.  This is particularly true on
machines like the 68020, where the inner loop fits into the on-chip
instruction cache, and this is the future. For instance, the ISO and TCP
versions differ by essentially one addition, and thus by 13 usecs
(for 1108 register-register adds!)
Thus, I argue that there is no significant performance penalty to adding a
few non-memory reference instructions for a rotate or extra addition
to get better detection capabilities.
3) Location of the checksum
I propose to place the VMTP checksum into a trailer portion of the packet.
This allows a high-performance network interface, such as the one we are
developing here at Stanford, to add the checksum as the packet is transmitted
to the network, avoiding 2 extra memory references.  It does not appear to
add any additional software overhead without such an interface, unless one
is trying to use a relatively simple DMA net interface and DMA packets
on transmission and reception without copy.  Since this is in general not
feasible without scatter-gather, and with, the trailer presents no problem,
I dont think this is a real disadvantage.
My vision of the future is of network interfaces that implement much or
all of the transport level, providing a service similar to the disk interface.
I.e. one does not generally worry in software about errors between main memory
and the disk.  Note that the error checking is still end-to-end at least
between network interfaces in the end machines.  Also, with request-response
protocols, the response is an application-level end-to-end acknowledgement of
the request.
For these interfaces, performance is determined by the number of memory
references.  The next generation of network interfaces must minimize
the number of memory references to transmit and receive packets or they
will be slow.  In fact, one of the big challenges (the only one?) is to
implement "intelligence" on the network interface without being killed
by the delay of extra copies (i.e. memory references.)
  Clearly, the value of the checksum is not known until the entire packet
has been scanned.  Putting the checksum at the end allows the checksum
to be calculated as part of transmitting the packet from the network
interface to the network.  The only other copy into which it might be
incorporated, between the host and the network interface, has several
problems.  First, the data is not necessarily packetized at that point,
and definitely not in our network interface design.  Second, the data rate
on that path can be so high (and must be in multiprocessors) that fitting
a checksum calculation into the copy is difficult.  Moreover, one may
want to do encryption at the same time.   For instance, we move data
at 320 megabits/sec. in our design for the VMEbus.  Much higher speeds
are planned.
Currently, we are dealing with very "1st" generation network interfaces
that are dealing with multi-megabit data rates.  I think the next generation
can be much more reliable, efficient, and powerful.  We need to design
the transport protocol such that it does not introduce unnecessary
difficulties for such interfaces.
Thus, I feel this minor modification to the protocol supports the right way
of doing network interfaces, and that it aids these interfaces in dealing
with an intrinsic problem, not one that will go away with a few more bells
and whistles on the interfaces.
One issue a trailer checksum raises is alignment relative to encryption.
I have tried to keep the header as a multiple of 64 bits.
With a null data segment and 32-bit checksum, the VMTP packet is still a
multiple of 64 bits.  If we force the datasegment to also be a multiple
of 64 bits, this all remains true still.  This is what I propose to do.
In summary, I recommend we go for a 32-bit checksum and use the ISO
checksum algorithm, modified to use 32-bit units.
The ISO checksum is a standard and has an article analyzing its
properties (to some extent). (If we also start the checksum
using a non-zero initial value, it should do well at detecting a packet
of all zeroes.) Finally, we place the checksum in a single 32-bit trailer
portion of the packet.
Comments?
David Cheriton
Ref>
J.G. Fletcher, An Arithmetic Checksum for Serial Transmission,
IEEE Trans. on Com. COM-30 (1) 247-251, Jan, 1982.