cheriton@PESCADERO.STANFORD.EDU (David Cheriton) (12/02/86)
Dear TCP-IPers: I have been working on a request-response transport protocol (VMTP) for possible adoption in the Internet. This work is being done as part of my participation in the IAB End-to-end task force, chaired by Bob Braden. Bob has asked me to include this mailing list in on some discussion of design choices for checksums. Here is my side of the story. VMTP Checksum Controversy There appears to be three basic issues re VMTP Checksum, or more generally a checksum for the next generation of transport protocols, namely: 1) what is the size of the checksum field. 2) the algorithm should be used. 3) Where in the packet should the checksum be located. I spent some time talking with John Gill and Marty Hellman of the Stanford Information Systems Lab. as well as thinking about it and some experimentation. (I also got some good feedback from others in my group, including Steve Deering, Michael Stumm, Ross Finlayson.) 1) Checksum Size The checksum is a probabilistic device, clearly. There are an enormous number of packets that map to the same checksum for either 16 or 32 bits of checksum. Let R be the packet rate (on average) in packets per second. Let P be the probability of a packet corrupted that is undetected by the lower levels, per packet. As long as P is non-zero, there is some time range T at which T*R*P becomes quite large, i.e. we have to expect some packets that are erroneous being delivered by the IP level. I'll refer to these as nasty packets. If we view the corruption as randomly assigning the packet a new checksum from the space of possible checksums, it is clear that, with a 16-bit checksum, there is a 1 in 65,536 chance that the corrupted packet will have the same checksum. With a 32-bit checksum, it is one in 4 billion. The key ingredient we expect to dramatically change over the next 5 years is the packet rate, for channels, switching nodes and gateways. It is reasonable to expect a significant increase in traffic, number of gateways as well as packet rate per gateway. Just to play with some numbers, let us assume by 1991, 500 gateways each handling 200 per sec. with probability 10**-7 of packets with errors undetected by lower levels (i.e. combination of all hardware and software from one endpoint to another.) means that there are 36 packets per hour with undetected errors at lower level. Assuming random mapping on to checksums, with 16 bit checksum, one might expect a packet whose error is not detected by the transport-level checksum every 75 days or so. Maybe this is OK? However, suppose these numbers are low? For instance, we occasionally gets bursts of packets at 2000/sec on our Ethernet. The future parameters and use of communication could easily push the time between undetected errors down by an order of magnitude (7 days?) Of course, one could argue that such a packet may still fail the addressing checks, the sequence number checks, etc. in the protocol. However, our measurements of VMTP indicate 1/2 as being minimal size packets and 1/4 as maximal size packets (for the Ethernet at least). With small packets, the probability of random mods to data versus header is .5 whereas for for large packets it is .95 for the data. Thus, this argument reduces the risk by far less than a factor of two. Moreover, it is disquieting to fall back on these random checks to help us when we expect the "real" error detection mechanisms is going to fail. One might also argue that my assumed error rate/per packet is too high. However, the probability of correct operation is the PRODUCT of the probabilities for the sequence of components involved in a packet transmission and delivery - so this error rate could get large fast. Moreover, one requires a more careful analysis to get a tighter bound. Can anyone offer a significant improvement in bounds without making some disconcerting assumptions?? In the same breath, one could argue that my analysis is far too optimistic. For instance, if the checksum algorithm uniformly maps all possible packets onto all possible checksums, then the fact that normal packets may be dramatically skewed to particular types, such as ASCII text packets to and from an important server, may mean that far fewer corruptions will be detected. In particular, a systematic hardware error that corrupts a packet into a nasty packet (with correct checksum) is far more likely to encounter the same packet (or similar enough) than if random packets were being generated. This "analysis" deals with pleasant, sunny day expected behavior. A more frightening situation is dealing with burst errors from failing components. For instance, suppose a gateway or network interface fails by curdling random portions of packets. We could see 200 undetected (below transport level) errors per second. Again, assuming random mods, with a 16-bit checksum we can expect to get a corrupted packet that passes the transport-level checksum every 5.5 minutes, versus every 8.2 days with a 32-bit checksum. If we consider a military situation where components may get wounded in action, it is easy to conceive of several things generating garbage at the worst possible time, given a very high exposure using a 16-bit checksum. The 32-bit checksum appears to give a needed marginal of safety, especially a variety of factors could cause one to effectively loss a factor of 2 or more. (As John Gill raised, are you sure 32-bits is enough?!) Overall, after studying this issue in brief and thinking thru the above, I am even more convinced we need to get to at least a 32-bit checksum. The increased expected packet rate is the key thing that is making 16-bit checksums unacceptable. 2) Checksum algorithm There appears to be a deficiency of theory behind checksums that are suitable for software implementation. Calculating 32-bit CRC's is infeasible - either too slow or requires an enormous lookup table. The ideal algorithm is: i) very fast to calculate and check in software. ii) can be shown to detects common types of packet corruption including, iii) can be shown to be unupdateable "thru" an encryption scheme, I.e. intruder cannot update (encrypted) checksum field to compensate for modifications to other encrypted data. Lots of schemes can be proposed but the more complex, the more difficult it is to analyze with respect to (ii) and (iii). The current algorithm, assuming we extend it to a 32-bit version does not detect appending or inserting 0's in a packet or the reordering of 16-bit (or 32-bit) units. These seem like not unlikely failures in a large internetwork with gateways or different byte orders and flakey hardware, etc. Another candidate is "add-and-rotate" as used by RDP, plus suggested (without guarantee) by John Gill. It also has the property that any single 32-bit word mod cannot leave the checksum calculated correctly (also true of the TCP algorithm). The "rotate" provides some detection of zeros and reordering. However, clearly there are occasions that it will fail to detect reordering of 32 long word blocks, or insertion of 32 long blocks of zeros. (It also requires some assembly language in most languages for efficient implementation.) Other properties are not clear to me. Another algorithm is the ISO transport algorithm, which is easily extended to 32-bits or 64-bits. I.e. R = Sum(1..L) Wi and S = Sum(1..L)(L-i+1)Wi where Wi is the ith (32-bit) word and there are L bits in the packet. According to Fletcher's analysis (See ref. below.), this algorithm detects all single bit errors and detects all but .001526 percent of all possible errors, same as CRC. It also detects all burst errors of length C bits, where C is the number of bits in the checksum. Fletcher claims: "it represents a very reasonable tradeoff between effectiveness and efficiency". I did some basic performance tests on these algorithms. The following gives the performance for the three algorithms for checksumming a VMTP packet with 1024 bytes of data (assumed to be long word aligned). Algorithm MC 68010 (10 MHz) MC 68020 (16 MHz) add-and-rotate 916 usec 320 usec. ISO Transport (32-bit) 914 usec 336 usec. ISO (16-bit) 1245 usec 446 usec. TCP 890 usec 323 usec. The TCP figure is really not too accurate - it lacks a test for carry on each 32-bit add. The first ISO version uses 32-bit adds to develop the two sums using 2's complement arithmetic (and then I just store the sum of the two sums - a proper scheme here would require 64-bits for the checksum field.) (Fletcher notes that the 1's complement version has slightly better properties but is more expensive to compute.) The second one uses 16-bit adds, which means a register swap is necessary as well as doubling the number of adds required. The results here appear to contradict Jon Crowcroft's comments on performance. Perhaps his factor of 4 comes from a poor implementation?? The ISO 32-bit version and the TCP version are in (portable) non-asm C, whereas the others are done in assembly language. An assembly version of ISO 32-bit appears about 80 usecs faster on a 68020. All these algorithms have the property of requiring only one memory reference (to get the 32-bit byte out of the packet) per word in the packet, which is the minimal memory reference cost, given 32-bit data paths. The numbers seem to indicate that a small number of instructions per long word (that do not increase # memory references) make no significant difference in performance of the algorithm. This is particularly true on machines like the 68020, where the inner loop fits into the on-chip instruction cache, and this is the future. For instance, the ISO and TCP versions differ by essentially one addition, and thus by 13 usecs (for 1108 register-register adds!) Thus, I argue that there is no significant performance penalty to adding a few non-memory reference instructions for a rotate or extra addition to get better detection capabilities. 3) Location of the checksum I propose to place the VMTP checksum into a trailer portion of the packet. This allows a high-performance network interface, such as the one we are developing here at Stanford, to add the checksum as the packet is transmitted to the network, avoiding 2 extra memory references. It does not appear to add any additional software overhead without such an interface, unless one is trying to use a relatively simple DMA net interface and DMA packets on transmission and reception without copy. Since this is in general not feasible without scatter-gather, and with, the trailer presents no problem, I dont think this is a real disadvantage. My vision of the future is of network interfaces that implement much or all of the transport level, providing a service similar to the disk interface. I.e. one does not generally worry in software about errors between main memory and the disk. Note that the error checking is still end-to-end at least between network interfaces in the end machines. Also, with request-response protocols, the response is an application-level end-to-end acknowledgement of the request. For these interfaces, performance is determined by the number of memory references. The next generation of network interfaces must minimize the number of memory references to transmit and receive packets or they will be slow. In fact, one of the big challenges (the only one?) is to implement "intelligence" on the network interface without being killed by the delay of extra copies (i.e. memory references.) Clearly, the value of the checksum is not known until the entire packet has been scanned. Putting the checksum at the end allows the checksum to be calculated as part of transmitting the packet from the network interface to the network. The only other copy into which it might be incorporated, between the host and the network interface, has several problems. First, the data is not necessarily packetized at that point, and definitely not in our network interface design. Second, the data rate on that path can be so high (and must be in multiprocessors) that fitting a checksum calculation into the copy is difficult. Moreover, one may want to do encryption at the same time. For instance, we move data at 320 megabits/sec. in our design for the VMEbus. Much higher speeds are planned. Currently, we are dealing with very "1st" generation network interfaces that are dealing with multi-megabit data rates. I think the next generation can be much more reliable, efficient, and powerful. We need to design the transport protocol such that it does not introduce unnecessary difficulties for such interfaces. Thus, I feel this minor modification to the protocol supports the right way of doing network interfaces, and that it aids these interfaces in dealing with an intrinsic problem, not one that will go away with a few more bells and whistles on the interfaces. One issue a trailer checksum raises is alignment relative to encryption. I have tried to keep the header as a multiple of 64 bits. With a null data segment and 32-bit checksum, the VMTP packet is still a multiple of 64 bits. If we force the datasegment to also be a multiple of 64 bits, this all remains true still. This is what I propose to do. In summary, I recommend we go for a 32-bit checksum and use the ISO checksum algorithm, modified to use 32-bit units. The ISO checksum is a standard and has an article analyzing its properties (to some extent). (If we also start the checksum using a non-zero initial value, it should do well at detecting a packet of all zeroes.) Finally, we place the checksum in a single 32-bit trailer portion of the packet. Comments? David Cheriton Ref> J.G. Fletcher, An Arithmetic Checksum for Serial Transmission, IEEE Trans. on Com. COM-30 (1) 247-251, Jan, 1982.