HORN%HYDRA@sdi.polaroid.COM (12/13/89)
Adding a checksum option with expanded size checksum (32, 64, ... bit) seems to be a reasonable way to start experimenting with error control algorithms for high speed networks. A similar experimental option for VMTP also seems reasonable. As transmission speeds increase I think that we will need to look more and more to forward error correction (FEC) techniques to replace ARQ techniques. With TCP, 64Kbit effective bandwidth, 250 msec end-to-end delays the ARQ impact on response time is modest and the buffering demands are low. At high bandwidth you have many megabytes in flight with corresponding major buffering demands. I would not be surprised to see the buffers move to disk or re-computation in some cases, with the result that a single ARQ has a very large performance impact. FEC techniques are improving steadily and are now in widespread use both within modems (typically convolution codes) and in media like CDROM (Reed Solomon codes). For the non-error case, there are table lookup approaches to RS (and some other) codes that reduce the computation load to 10-20 instructions per byte for a code that can withstand 0.001 ber. The computation needed to repair an error is much larger. These approaches only minimize the cost of generating and verifying checkwords. One of the important aspects of using FEC is a good characterization of the error modes of the telemetry system. This may be the most difficult aspect of introducing it into any large network. You do need to understand how errors occur, what are typical error burst sizes, deletion and insertion modes, etc. Examination of the system level tradeoffs would be worthwhile. FEC costs somewhat more CPU (rapidly getting cheaper) and uses a controllable percentage of the bandwidth (typically a few percent) while drastically reducing the needs for buffers at both ends. The extra cost for FEC at CDROM rates (about 1Mbit/s) is a few hundred dollars at most (retail pricing). FEC can always retreat to ARQ for cases beyond its ability to repair. R Horn horn%hydra@polaroid.com
mckenney@aai8.itstd.sri.com (Paul E. McKenney) (12/15/89)
In article <0CEC4873D71F21B02A@sdi.polaroid.com> HORN%HYDRA@sdi.polaroid.COM writes: >[ . . . ] >As transmission speeds increase I think that we will need to look >more and more to forward error correction (FEC) techniques to >replace ARQ techniques. Nachum Shacham and I have recently written a paper on reducing the need for ARQ through use of FEC. The title is ``Packet Recovery in High-Speed Networks Using Coding and Buffer Management''; drop me a note with your USMail address if you would like a draft copy. Thanx, Paul
rpw3@rigden.wpd.sgi.com (Robert P. Warnock) (12/15/89)
HORN%HYDRA@sdi.polaroid.COM writes: +--------------- | FEC techniques are improving steadily and are now in widespread use | both within modems (typically convolution codes) and in | media like CDROM (Reed Solomon codes). For the non-error case, | there are table lookup approaches to RS (and some other) codes that | reduce the computation load to 10-20 instructions per byte for a | code that can withstand 0.001 ber. The computation needed to repair | an error is much larger... +--------------- Actually, correcting can be pretty simple for certain special cases, such as the modified (255, 252) R-S code that IBM used in the 3370(?) disk. (Sorry, not sure about the model number; my Costello & Liu is at home.) That code computes and transmits the residuals over the various primitive polynomials separately, rather than with a single generator polynomial with the primitives as factors (which is the usual case). It will correct one bad byte in 252 bytes of data. (You can use shorter blocks, but each block needs 3 check bytes.) In the special case of using a**(-1), a**0, and a**1 ["a" primitive in GF(256)] as the divisors, three checksums (bytes) are generated, and if you have 256-byte lookup tables to do your GF(256) multiplication (o.k., division by multiplying by the inverse), you get code like this (unoptimized!): for (i = 0; i <n; ++i) { Cm1 = recip_table_Am1[ Cm1 ^ data[i] ]; C0 = C0 ^ data[i]; /* division by a^0 is simple XOR */ C1 = recip_table_A1[ C1 ^ data[i] ]; } Then at the other end, you would compute rCm1, rC0, and rC1 (for "received C_{-1,0,1}") similarly over the received data (*except* the trailing three bytes, which are Cm1, C0, and C1), and XOR them respectively with Cm1, C0, and C1 as received, then if C0 == 0 you have no error. If at most one byte is in error, then C0 is the bits in error and C1/C0 (arithmetic done in GF(256), by table lookup) is the byte number containing the error, and Cm1 is used (I forgot how, probably by computing the reciprical position number?) to detect (somewhat) against multiple errors. Oh, you correct the error this way (simple!): bad_byte_num = GF_div(C1, C0); /* GF_div is the table-lookup divide */ data[bad_byte_num] ^= C0; Anyway, on the transmit side you have 3 XORs and two table lookups (LOADs) per byte, and on the receive side you have the same plus a couple of XORs to check for errors, and a couple of table lookups and another XOR if you need to correct. If you also have a stronger checksum somewhere in the data (embedded CRC-32, Fletcher, etc.), you can use that to guard against miscorrection, and drop the Cm1 byte. In that case the R-S code adds just two bytes to each block of 253 bytes or less, and the CPU time is just over half of what was shown above. (Note that your other checksum must be *inside* the block.) After correcting as per the R-S code (which just might correct a byte of your other checksum), you recompute/check your other checksum on the corrected data, and if it's o.k., you probably didn't miscorrect. (Been thinking about using this code for SLIP over noisy lines. Now if I can just find time to actually write/test/post the C code for the R-S routines...) ----- Rob Warnock, MS-9U/510 rpw3@wpd.sgi.com rpw3@pei.com Silicon Graphics, Inc. (415)335-1673 Protocol Engines, Inc. 2011 N. Shoreline Blvd. Mountain View, CA 94039-7311