rhyde@ucrmath.ucr.edu (randy hyde) (04/20/91)
I finally got my act together and I'm submitting some examples of CRC-16/ CCITT algorithms. For a really good discussion of the CRC-16/CCITT algo- rithm, please see "The C Programmer's Guide to Serial Communications" by Campbell. This is a great book and should be in the hands of anyone who needs to deal with any form of serial communications, regardless of the language they happen to program in. A *brief* description of CRC-16/CCITT: There are many CRC (cyclic redundancy check) algorithms laying around. For example, someone posted the CRC algorithm used by MINIX on the net here, it is not the same as the standard CRC-16/CCITT algorithm (I cannot verify this since I do not yet have Minix sources, the post could have been wrong). CRC algorithms are used like checksums and parity-- they're used to verify the correct- ness of a block of data. The CRC-16/CCITT algorithm has many advantages over checksums and parity checks. Such a description is beyond the scope of this article. See the reference above for more details. Also, the Aug/Sep 1990 C User's Journal had several articles about CRC's and other data checking algorithms dealing with serial communications. The CRC-16/CCITT algorithm basically computes the remainder of an n-bit number divided by 0x1021 using 1's complement arithmetic (mod 2). This algorithm was chosen because it is fairly reliable at detecting all sorts of errors *and* (most importantly) it is very easy to implement in hardware. CRC's were originally designed to verify data coming off a disk drive. Since the data comes off the disk at a rapid rate, software (at the time) was unable to keep up with it. Therefore, they needed to compute the CRC using hardware. A 1's complement division by 0x1021 is easy to implement using a couple of shift registers and XOR gates. Therefore, a CRC-16 algorithm is essentially a long division algorithm which uses XOR rather than subtraction at each step in the division. It throws away the quotient and keeps the remainder. There is actually a little more to it than this, but that's close enough for now. To compute the CRC-16/CCITT value (I'll just call it CRC from here on out)` you initialize the crc value to zero (many modern algorithms initialize it to 0xffff instead, for some very good reasons). You then grab each byte from the message and, one bit at a time, xor the H.O. bit of the data with the H.O. bit of the current checksum. If the result is zero, you simply shift the crc (16-bits) and the data (8 bits) one position to the left. If the XOR of the H.O. bits is one, you still shift the data byte one position to the left, the crc value, however, you must shift and then xor with 0x1021 (the CCITT "polynomial"). You repeat this for all eight bits and then repeat the process for the entire byte stream of the message. The following C(++) code does this: #include <stdio.h> int crc(int data, int curcrc) { static int i; for (i=0; i<8; ++i) { if ((data << 8) ^ curcrc) curcrc = (curcrc << 1) ^ 0x1021; else curcrc <<= 1; data <<= 1; } return curcrc; } int main() { int i,j,c; for (i=0; i < 10000; ++i) { c = 0; for (j=0; j < 256; ++j) c = crc(j,c); } puts("done"); } BTW, there is a minor bug (affecting portability) in this code. Can you spot it? I left this bug in to make the above code much easier to read. I will present a version of this code in a moment with the bug removed. Running time for this algorithm is 3:39 on a 12Mhz 80286 (I had my laptop with me at Cal Poly yesterday while working on this; I apologize for not doing this on a GS, but my GS isn't exactly portable). Note: I used Borland`s BC (C++) compiler, emitting 8086 code with optimizations turned on (although I doubt optimization would make any difference on such a small, simple, program) Implemented in (8086) assembly language, the above program looks like the following: include stdlib.a ;**************************************************************************** ; ; ; ; ; cseg segment para public 'code' assume cs:cseg, ds:dseg ; ; public PSP PSP dw ? ;Must be in CODE segment! ; ; ; Some useful constants: ; cr equ 13 lf equ 10 eos equ 0 ; true equ 1 false equ 0 ; ; ; Main is the main program. Program execution always begins here. ; Main proc mov cs:PSP, es ;Save pgm seg prefix mov ax, seg dseg ;Set up the segment registers mov ds, ax mov es, ax mov dx, 0 ;Allocate all available RAM. MemInit ; ; mov cx, 10000 crclp1: xor dx, dx ;Set DX to zero. mov ax, dx crclp2: call crc inc al jne crclp2 dec cx jne crclp1 print db "done",0 ; ; ; Quit: mov ah, 4ch int 21h ; Main endp ; ; ; CRC routine: ; ; Entry: ; DX contains current crc value ; AL contains new character ; ; Exit: ; DX contains new crc value ; crc proc near push cx push ax mov cx, 8 ;Repeat once for each bit. BitLoop: mov ah, al ;Get a copy of data value. xor ah, dh ;This clears the carry! jns NoCCITT ;XOR of H.O. bits = 1? xor dx, 0810h ;1021h shr 1 stc ;Underflow from 1021 shr 1 ; ; The following rcl is just a SHL if we branched from above (since the XOR ; instruction clears the carry flag). It is an SHL which shifts in a 1 ; if we fall through from above. Shifting in a one supplies the L.O. bit ; of 1021h which we XOR'd in above. ; NoCCITT: rcl dx, 1 shl al, 1 ;Go to the next bit in data. loop BitLoop pop ax pop cx ret crc endp ; ; cseg ends ; ; ; Allocate a reasonable amount of space for the stack (2k). ; sseg segment para stack 'stack' stk db 256 dup ("stack ") sseg ends ; ; end Main Again, I apologize for 8086 assembly, it's all I had available at the time. Running time for this code on the same machine as above: 1:19. The above are the simple and easy to understand versions of the CRC algorithm. Alas, they are very slow. Keep in mind, the CRC algorithm may need to keep up with a fast serial line (or some other fast device). We certainly don't want to lose characters because of CRC computation. Before jumping into major optimizations, let's consider a very common optimization that just about any assembly language programmer would think of and good "C" programmers would think of: unravelling (unrolling) the loops. The loop in the crc function can be easily unrolled since it executes exactly eight times each time you call it. Since the function is rather simple, why bother with the function call? Why not include the code in-line and dispense with the overhead of the call? The following implement this: #include <stdio.h> #define crc18(data, curcrc) if ((data << 8) ^ curcrc) \ curcrc = ((curcrc << 1) ^ 0x1021) & 0xffff; \ else curcrc = (curcrc << 1) & 0xffff;\ data = (data << 1) & 0xff00; int crc(int data, int curcrc) { static int i; data <<= 8; for (i=0; i<8; ++i) { if (data ^ curcrc) curcrc = ((curcrc << 1) ^ 0x1021) & 0xffff; else curcrc = (curcrc << 1) & 0xffff; data = (data << 1) & 0xff00; } return curcrc; } int main() { int i,j,k,c; for (i=0; i < 10000; ++i) { c = 0; for (j=0; j < 256; ++j) c = crc(j,c); } puts("done, press return for macro version"); c = getchar(); for (i=0; i < 10000; ++i) { c = 0; for (j=0; j < 256; ++j) { k=j; crc18(c,k); crc18(c,k); crc18(c,k); crc18(c,k); crc18(c,k); crc18(c,k); crc18(c,k); crc18(c,k); } } puts("done"); } This code implements *both* the function call and the unrolled version. If you didn't notice the bug I corrected, shorts and ints in C have no guaranteed size other than they are at least 16 bits long (ANSI C). You have no guarantee that shorts are 16 bits long, which is why we need the and with 0xffff above. If your C compiler has 16-bit shorts, I would hope that it would optimize out this operation. Running time for this code: 2:02 (at least for the in-line version). A substantial improvement. Consider the same code in 8086 assembly: include stdlib.a ;**************************************************************************** ; Crc18 macro local NoCCITT mov ah, al xor ah, dh ;This clears the carry! jns NoCCITT xor dx, 0810h ;1021h shr 1 stc ;Underflow from 1021 shr 1 NoCCITT: rcl dx, 1 shl al, 1 endm ; ; ; cseg segment para public 'code' assume cs:cseg, ds:dseg ; ; public PSP PSP dw ? ;Must be in CODE segment! ; ; ; Some useful constants: ; cr equ 13 lf equ 10 eos equ 0 ; true equ 1 false equ 0 ; ; ; Main is the main program. Program execution always begins here. ; Main proc mov cs:PSP, es ;Save pgm seg prefix mov ax, seg dseg ;Set up the segment registers mov ds, ax mov es, ax mov dx, 0 ;Allocate all available RAM. MemInit ; ; mov cx, 10000 crclp1: xor dx, dx ;Set DX to zero. mov ax, dx crclp2: push ax crc18 crc18 crc18 crc18 crc18 crc18 crc18 crc18 pop ax inc al jne crclp2 dec cx jne crclp1 print db "done",0 ; ; ; Quit: mov ah, 4ch int 21h ; Main endp ; ; ; ; cseg ends ; ; ; Allocate a reasonable amount of space for the stack (2k). ; sseg segment para stack 'stack' stk db 256 dup ("stack ") sseg ends ; ; ; end Main Running time for this code: 0:48. Again a substantial improvement. I would like to point out that the above algorithms are about as far as you can take the CRC-16/CCITT algorithm and still have code that is easily recognizable for what it does. Additional optimizations will destroy the readability of the code. Nonetheless, even 48 seconds may be a little too slow for someone wanting to do 2,560,000 CRC computations. Let's explore the next step. Any assembly language programmer worth his salt would immediately suggest "table look-up" as a means to optimize the program. A good C programmer would also make this suggestion (although you see this technique used *far* less often in C than in assembly). The code Doug posted used this technique to speed things up (which is why the C code was very difficult to follow). A modified version of that code (modified so that the calling sequence is similar to the above routines) follows: #include <iostream.h> unsigned crctab[2][16] = { 0x0000, 0xc0c1, 0xc181, 0x0140, 0xc301, 0x03c0, 0x0280, 0xc241, 0xc601, 0x06c0, 0x0780, 0xc741, 0x0500, 0xc5c1, 0xc481, 0x0440, 0x0000, 0xcc01, 0xd801, 0x1400, 0xf001, 0x3c00, 0x2800, 0xe401, 0xa001, 0x6c00, 0x7800, 0xb401, 0x5000, 0x9c01, 0x8801, 0x4400}; int crc(int data, int curcrc) { static int temp; temp = curcrc ^ data; return crctab[0] [temp & 0xf] ^ crctab[1][(temp >>4) & 0xf] ^ (curcrc >> 8); } int main() { int i,j,c; for (i=0; i < 10000; ++i) { c = 0; for (j=0; j < 256; ++j) c = crc(j,c); } cout << "done"; } This code is an interesting compromise between space and speed. Rather than use a 256 entry lookup table indexed by each byte of data, this algorithm uses the two nibbles of each data byte as an index into two halves of a 2D matrix (one side for L.O. nibble, one side for H.O. nibble). Someone had to think long and hard to figure this one out, but more on that later. Running time: 0:47 A *very* substantial improvement over the bit at a time algorithm. A couple of comments about the above algorithm- the use of a 2D array is weird. On a compiler which doesn't optimize things too well, this would cost you dearly (since it would require a multiply and add to compute the address of the selected array element). Why the original author didn't use two separate arrays escapes me. This is probably a case of not realizing the underlying data structure (e.g., they should have known more assembly) or they didn't care about performance across architectures and compilers. I cannot imagine an assembly language programmer attacking this problem in this fashion. Therefore, I didn't bother implemented it. Most assembly language programmers would go directly to a 256-element look up table and use the data byte as an index into that table. That's what the next section of code does: If you're willing to live with a 256 element array rather than a 32-element array, the following C code is even better: #include <iostream.h> unsigned crctab[256] = { 0x0000, 0x1021,0x2042,0x3063,0x4084,0x50A5,0x60C6,0x70E7,0x8108, 0x9129,0xA14A,0xB16B,0xC18C,0xD1AD,0xE1CE,0xF1EF,0x1231, 0x0210,0x3273,0x2252,0x52B5,0x4294,0x72F7,0x62D6,0x9339, 0x8318,0xB37B,0xA35A,0xD3BD,0xC39C,0xF3FF,0xE3DE,0x2462, 0x3443,0x0420,0x1401,0x64E6,0x74C7,0x44A4,0x5485,0xA56A, 0xB54B,0x8528,0x9509,0xE5EE,0xF5CF,0xC5AC,0xD58D,0x3653, 0x2672,0x1611,0x0630,0x76D7,0x66F6,0x5695,0x46B4,0xB75B, 0xA77A,0x9719,0x8738,0xF7DF,0xE7FE,0xD79D,0xC7BC,0x48C4, 0x58E5,0x6886,0x78A7,0x0840,0x1861,0x2802,0x3823,0xC9CC, 0xD9ED,0xE98E,0xF9AF,0x8948,0x9969,0xA90A,0xB92B,0x5AF5, 0x4AD4,0x7AB7,0x6A96,0x1A71,0x0A50,0x3A33,0x2A12,0xDBFD, 0xCBDC,0xFBBF,0xEB9E,0x9B79,0x8B58,0xBB3B,0xAB1A,0x6CA6, 0x7C87,0x4CE4,0x5CC5,0x2C22,0x3C03,0x0C60,0x1C41,0xEDAE, 0xFD8F,0xCDEC,0xDDCD,0xAD2A,0xBD0B,0x8D68,0x9D49,0x7E97, 0x6EB6,0x5ED5,0x4EF4,0x3E13,0x2E32,0x1E51,0x0E70,0xFF9F, 0xEFBE,0xDFDD,0xCFFC,0xBF1B,0xAF3A,0x9F59,0x8F78,0x9188, 0x81A9,0xB1CA,0xA1EB,0xD10C,0xC12D,0xF14E,0xE16F,0x1080, 0x00A1,0x30C2,0x20E3,0x5004,0x4025,0x7046,0x6067,0x83B9, 0x9398,0xA3FB,0xB3DA,0xC33D,0xD31C,0xE37F,0xF35E,0x02B1, 0x1290,0x22F3,0x32D2,0x4235,0x5214,0x6277,0x7256,0xB5EA, 0xA5CB,0x95A8,0x8589,0xF56E,0xE54F,0xD52C,0xC50D,0x34E2, 0x24C3,0x14A0,0x0481,0x7466,0x6447,0x5424,0x4405,0xA7DB, 0xB7FA,0x8799,0x97B8,0xE75F,0xF77E,0xC71D,0xD73C,0x26D3, 0x36F2,0x0691,0x16B0,0x6657,0x7676,0x4615,0x5634,0xD94C, 0xC96D,0xF90E,0xE92F,0x99C8,0x89E9,0xB98A,0xA9AB,0x5844, 0x4865,0x7806,0x6827,0x18C0,0x08E1,0x3882,0x28A3,0xCB7D, 0xDB5C,0xEB3F,0xFB1E,0x8BF9,0x9BD8,0xABBB,0xBB9A,0x4A75, 0x5A54,0x6A37,0x7A16,0x0AF1,0x1AD0,0x2AB3,0x3A92,0xFD2E, 0xED0F,0xDD6C,0xCD4D,0xBDAA,0xAD8B,0x9DE8,0x8DC9,0x7C26, 0x6C07,0x5C64,0x4C45,0x3CA2,0x2C83,0x1CE0,0x0CC1,0xEF1F, 0xFF3E,0xCF5D,0xDF7C,0xAF9B,0xBFBA,0x8FD9,0x9FF8,0x6E17, 0x7E36,0x4E55,0x5E74,0x2E93,0x3EB2,0x0ED1,0x1EF0}; int crc(int data, int curcrc) { static int comb_val; comb_val = (curcrc >> 8) ^ data; return (curcrc << 8) ^ crctab[comb_val]; } int main() { int i,j,c; for (i=0; i < 10000; ++i) { c = 0; for (j=0; j < 256; ++j) c = crc(j,c); } cout << "done"; } Running time: 0:39; again, a substantial improvement. Now look at the same code in 8086 assembly: include stdlib.a ;**************************************************************************** ; ; ; Global variables go here: ; dseg segment para public 'data' ; .radix 16 crctbl dw 00000 dw 01021,02042,03063,04084,050A5,060C6,070E7,08108 dw 09129,0A14A,0B16B,0C18C,0D1AD,0E1CE,0F1EF,01231 dw 00210,03273,02252,052B5,04294,072F7,062D6,09339 dw 08318,0B37B,0A35A,0D3BD,0C39C,0F3FF,0E3DE,02462 dw 03443,00420,01401,064E6,074C7,044A4,05485,0A56A dw 0B54B,08528,09509,0E5EE,0F5CF,0C5AC,0D58D,03653 dw 02672,01611,00630,076D7,066F6,05695,046B4,0B75B dw 0A77A,09719,08738,0F7DF,0E7FE,0D79D,0C7BC,048C4 dw 058E5,06886,078A7,00840,01861,02802,03823,0C9CC dw 0D9ED,0E98E,0F9AF,08948,09969,0A90A,0B92B,05AF5 dw 04AD4,07AB7,06A96,01A71,00A50,03A33,02A12,0DBFD dw 0CBDC,0FBBF,0EB9E,09B79,08B58,0BB3B,0AB1A,06CA6 dw 07C87,04CE4,05CC5,02C22,03C03,00C60,01C41,0EDAE dw 0FD8F,0CDEC,0DDCD,0AD2A,0BD0B,08D68,09D49,07E97 dw 06EB6,05ED5,04EF4,03E13,02E32,01E51,00E70,0FF9F dw 0EFBE,0DFDD,0CFFC,0BF1B,0AF3A,09F59,08F78,09188 dw 081A9,0B1CA,0A1EB,0D10C,0C12D,0F14E,0E16F,01080 dw 000A1,030C2,020E3,05004,04025,07046,06067,083B9 dw 09398,0A3FB,0B3DA,0C33D,0D31C,0E37F,0F35E,002B1 dw 01290,022F3,032D2,04235,05214,06277,07256,0B5EA dw 0A5CB,095A8,08589,0F56E,0E54F,0D52C,0C50D,034E2 dw 024C3,014A0,00481,07466,06447,05424,04405,0A7DB dw 0B7FA,08799,097B8,0E75F,0F77E,0C71D,0D73C,026D3 dw 036F2,00691,016B0,06657,07676,04615,05634,0D94C dw 0C96D,0F90E,0E92F,099C8,089E9,0B98A,0A9AB,05844 dw 04865,07806,06827,018C0,008E1,03882,028A3,0CB7D dw 0DB5C,0EB3F,0FB1E,08BF9,09BD8,0ABBB,0BB9A,04A75 dw 05A54,06A37,07A16,00AF1,01AD0,02AB3,03A92,0FD2E dw 0ED0F,0DD6C,0CD4D,0BDAA,0AD8B,09DE8,08DC9,07C26 dw 06C07,05C64,04C45,03CA2,02C83,01CE0,00CC1,0EF1F dw 0FF3E,0CF5D,0DF7C,0AF9B,0BFBA,08FD9,09FF8,06E17 dw 07E36,04E55,05E74,02E93,03EB2,00ED1,01EF0; ; .radix 10 dseg ends ; ; ; ; cseg segment para public 'code' assume cs:cseg, ds:dseg ; ; ; public PSP PSP dw ? ;Must be in CODE segment! ; ; ; Main proc mov cs:PSP, es ;Save pgm seg prefix mov ax, seg dseg ;Set up the segment registers mov ds, ax mov es, ax ; mov cx, 10000 crclp1: xor dx, dx ;Set DX to zero. mov ax, dx crclp2: call crc inc al jne crclp2 dec cx jne crclp1 print db "done",0 ; ; Quit: mov ah, 4ch int 21h ; Main endp ; ; ; CRC update routine. ; ; On Entry: ; AL- New data value ; DX- Current CRC value ; ; On Exit: ; DX- New CRC value ; crc proc near mov bl, dh xor bl, al mov bh, 0 mov dh, dl mov dl, bh shl bx, 1 xor dx, crctbl[bx] ret crc endp ; ; ; cseg ends ; ; ; Allocate a reasonable amount of space for the stack (2k). ; sseg segment para stack 'stack' stk db 256 dup ("stack ") sseg ends ; ; end Main Running time: 0:17 I could probably get this down to 0:15 or 0:16 if I encoded the CRC calculation in-line rather than as a procedure/function. Assembly language typically works out a little better than twice as fast. The only surprise here is that it's actually that good. For such simple, short, routines, ASM's advantage over C usually isn't this good. For a CRC algorithm, speed is very important, so ASM is, perhaps, a better choice. But that wasn't the purpose of doing all this. Earlier (on the net), I'd made the statement that for a CRC algorithm, ASM isn't particularly more difficult to understand than C (and this is true, in general, for most bit-pushing operations). Now I'm assuming we're talking about expert C programmers looking at C code and expert assembly language programmers looking at ASM code. It isn't fair to take an expert C programmer who is a mediocre ASM programmer and have that same person evaluate C and ASM code. I claim the ASM code above isn't any more difficult to understand than the C code (assuming, of course, that you know the algorithm that the code is attempting to implement). In Doug's original post, he claimed that the ASM code wasn't any more readable than the C code. The reasons are two-fold: (1) the code used a truly miserable algorithm which was next to impossible to understand in the first place (*NO* language is going to make that algorithm particularly clear), (2) the ASM code looked like the output of a compiler. I would be ashamed of myself if I wrote assembly code that looked like that. Clearly compiler-generated assembly code is *not* going to be clearer to understand than hand written code. My apologizes if I'm offending an actual human author of that code. One final point- Note that my minor optimizations in assembly language` (unrolling the loop) ran at the same speed as the nibble loop-up table optimized code. For the same speed, I claim my assembly code is *much* easier to understand than the optimized C code. From a development point of view, it would take far more time to dream up, implement, and verify the nibble-based C code that Doug posted than it would take to implement the unrolled version of the original CRC algorithm in assembly. I offer this as proof that assembly is often the better choice for a project. Sure, you can find better algorithms in C which run faster than bad algorithms in ASM, but the end result is that you spend *far* more time designing, testing, and debugging the C code. And if you can find a better algorithm that works in C, that same algorithm will work in assembly as well. Keep in mind, that the CRC algorithm is intrinsically O(n). You're not going to do any better than this because it takes n operations to read the data. Therefore, the only thing we can attack is the constant in the complexity of this algorithm. ASM is really good at this. *** Randy Hyde
gwyn@smoke.brl.mil (Doug Gwyn) (04/21/91)
In article <13688@ucrmath.ucr.edu> rhyde@ucrmath.ucr.edu (randy hyde) writes: >Assembly language typically works out a little better than twice as fast. But the comparison was not really fair, since the assembly-language example implementation did not support the same linkage as the C code. For a function that performs so little work, function linkage overhead is an important factor. Also, the example C code was not optimal for the given algorithm. By making a few minor tweaks to it I reduced the C example run time by a factor of 26% on the host on which I read net news. Profiling the code showed that approximately as much time was being spent in the main program's loop overhead as in the crc function itself. For the revised crc function C source code: int crc(data, curcrc) int data; register unsigned int curcrc; { return (curcrc << 8) ^ crctab[(curcrc >> 8) ^ data]; } a fairly ordinary old-technology C compiler on a Gould PowerNode produced the following machine code: _crc: enter 8w,0x5 -- function linkage movw 9w+8w[b2],r1 -- pick up "curcrc" argument movw r1,r0 -- make a second copy of it lslw #8,r0 -- shift one copy left (accumulator) lsrw #8,r1 -- shift the other copy right xorw 8w+8w[b2],r1 -- XOR shifted copy with "data" argument lslw #2,r1 -- shift for indexing crctab[] xorw _crctab[r1],r0 -- XOR table entry with accumulator retn -- function linkage and a similar compiler on a DEC VAX produced the following machine code: _crc: .word 0x800 ; function linkage movl 8(ap),r11 ; pick up "curcrc" argument extzv $8,$24,r11,r0 ; shift one copy left (accumulator) xorl2 4(ap),r0 ; XOR shifted copy with "data" argument ashl $8,r11,r1 ; shift the other copy right xorl3 r1,_crctab[r0],r0 ; XOR table entry with accumulator ret ; function linkage Note that there isn't very much that one could do even by hand-coded assembly language to perform this computation much more efficiently. (Except that the VAX actually has microcode support for a CRC instruction! That is an example of a situation in which use of assembly language could be justified, since it is somewhat rare for a C compiler to exploit such specialized instructions. Another possible example would be in a bottleneck loop when testing the "carry bit" would be much more efficient than trying to do without it; originally I thought that was going to be the justification for coding the CRC algorithm in assembler.) >I claim the ASM code above isn't any more difficult to understand than the >C code (assuming, of course, that you know the algorithm that the code >is attempting to implement). For such a small function, the conceptual burden is indeed manageable. For a larger example involving linked data structures, etc. I would prefer to deal with C code over assembly code. >From a development point of view, it would take far more time to dream >up, implement, and verify the nibble-based C code that Doug posted than >it would take to implement the unrolled version of the original CRC >algorithm in assembly. But that's comparing apples and oranges. For the same algorithm with the same constraints (particularly function linkage), I don't think you've shown that assembly language code is appreciably faster than C code, taking into account my earlier observations. >I offer this as proof that assembly is often the better choice for a >project. Sure, you can find better algorithms in C which run faster than >bad algorithms in ASM, but the end result is that you spend *far* more >time designing, testing, and debugging the C code. And if you can find >a better algorithm that works in C, that same algorithm will work in >assembly as well. The C code, if you do it right, is immediately usable on a variety of machines, and development cost of the algorithm is independent of the implementation language. Implementation of the algorithm in C is no harder than in assembly language, and for complex algorithms it is very nice to be able to rely on the compiler for help with the bookkeeping. Why do you think high-level languages were invented? >Keep in mind, that the CRC algorithm is intrinsically O(n). You're not >going to do any better than this because it takes n operations to read >the data. Therefore, the only thing we can attack is the constant >in the complexity of this algorithm. ASM is really good at this. And if you look at the above C compiler output, you'll see that C is also really good at this, at least on decent machine architectures. As I have previously remarked, there are legitimate uses for assembler, but since C is about as fast, infinitely more portable, and easier to use for large projects, I recommend C over assembler in the majority of situations.
mdavis@pro-sol.cts.com (Morgan Davis) (04/22/91)
In-Reply-To: message from rhyde@ucrmath.ucr.edu Excellent article on CRC's, Randy. This is one to tuck away in the programmer's notes for sure. I went through this kind of wracking to find a more optimal algorithm. The table method is, by far, the best when it comes to speed. Though, sometimes just a quick and dirty (or maybe that should be "small and dirty") variation is to use the following: CalcCRC (ptr, count) char *ptr; word count; { register word x; do { CRC ^= *ptr++ << 8; /* XOR hi-byte with data */ for (x = 8; x; --x) if (CRC & 0x8000) CRC = CRC << 1 ^ 0x1021; else CRC <<= 1; } while (--count); } This function operates on a global named CRC. You call it by pointing to a data buffer, and passing the size of the buffer. After calling CaclCRC() the CRC global integer is updated accordingly. The optimizations here that differ from the original, frequently published version is that it: o Counts from 8 bits down to none, so the loop test simply involves a test for zero instead of comparing to 8. o Bit 15 of the CRC is tested with the bitwise AND operation instead of shifting it around to determine if the CRC needs the XOR treatment. o It operates on a buffer, rather than having to pass it a byte of data to work on, one at a time. This optimization was done with the 65816 in mind. So these kinds of optimizations may not make any difference on other processors, but I doubt if they would hurt. BTW, the 65816 variant on the above is: CalcCRC ldy #0 ;index into buffer newbyte shortm ;go to 8-bits lda [ptr],y ;grab some data iny ;prepare for next byte eor CRC+1 ;fix high-order byte of CRC sta CRC+1 longm ;back to 16-bit ldx #8 ;init shift counter shift asl CRC ;always shift it once bcc next ;if bit 15 clear, skip the XOR lda CRC ;XOR CRC with $1021 eor #$1021 sta CRC next dex bne shift ;do this 8 times dec count ;more bytes to do? bne newbyte ;yes rts --Morgan UUCP: crash!pro-sol!mdavis AOL, BIX: mdavis ARPA: crash!pro-sol!mdavis@nosc.mil GEnie: m.davis42 INET: mdavis@pro-sol.cts.com ProLine: mdavis@pro-sol
toddpw@nntp-server.caltech.edu (Todd P. Whitesel) (04/22/91)
Morgan, if you can live with 512 bytes of CRC tables then you can do block CRC's in something like 25 cycles per byte CRC'd... lda CRC xba sep #$30 crclp eor [data],y tax lda tblL,x xba eor tblH,x iny bne crclp rep #$30 xba sta CRC will do 256 bytes at a time, with a best case of 24 cycles/byte and a worst of 27. (Aligning the DP saves 1, aligning tblH and tblL guarantees 4 cycles each.) I know the table is generated by something simple like (sombody check me): tbl[byte] = updcrc(byte, 0); where updcrc(byte, oldcrc) does not use the table, and tbl is split into high and low byte arrays for the assembly fragment above. Todd Whitesel toddpw @ tybalt.caltech.edu
rhyde@gibson.ucr.edu (randy hyde) (04/23/91)
You keep claiming it isn't fair to compare an assembly language code routine
to a C routine when the assembly language code doesn't support C linkages.
Why would someone writing their code in assembly language use such linkages?
Of course if you were writing a single routine to link into a big C program.
But I'm talking about someone writing the entire program in C.
Clearly a small program is easier to manage and understand, regardless of the
language. But the performance benefits of small programs side with HLLs as
well. In the CRC example I have, the ASM program was only twice as fast as
the C version. My experience on the 8086 is that the larger the program, the
larger the performance difference. This is being reduced on newer processors
as companies like Motorola and Intel optimize their instruction sets for those
30% of the instructions that ***C and Pascal*** compilers use. It's still true
in general though, that as the program gets larger, so does the speed
difference.
So far, my experience with well-written C programs and well-written asm
programs
is that they're both easy for me to understand (regardless of the size). The
only problem is that very few people write well-written C or ASM
programs. Don't
try to tell me it's easier to write well-written C programs. My
experience shows
that either (1) it is not, or (2) C programmers aren't as good as ASM
programmers
because even though it's easier, they don't write better C programs. Most C
programs are a mess. I don't want flames from people saying "*I* write easy to
read C programs." Anyone can post a single example. In general, few people
(including myself) write easy to read and understand C programs all the time.
I can't comment on which is worse, a poorly written C program or a
poorly written assembly language program. They're both really bad.
>> I recommend C... for the majority of projects...
Surprise! So do I. Commercial software products is the one area where
I draw the line though. Most projects are not commercial (by my
definition, mass market consumer) products. This is a situation where
performance is more important than portability and, arguably, more
important than maintainability.
Actually, I should revise the above statement, I don't recommend C for
the majority of projects, actually I recommend an appropriate HLL for
the job; this is not always C. Most projects I recommend languages for
are not in the class of commercial products.
*** Randy Hdye
gwyn@smoke.brl.mil (Doug Gwyn) (04/24/91)
In article <13769@ucrmath.ucr.edu> rhyde@gibson.ucr.edu (randy hyde) writes: >You keep claiming it isn't fair to compare an assembly language code routine >to a C routine when the assembly language code doesn't support C linkages. >Why would someone writing their code in assembly language use such linkages? >Of course if you were writing a single routine to link into a big C program. >But I'm talking about someone writing the entire program in C. The reason is that you need to impose identical constraints ("requirements") on the task if you want to have a fair comparison. The C-generated machine code that I posted shows that at least some C compilers generate fairly decent instruction sequences for the algorithm itself. The timing for the CRC-16 example was dominated by "overhead" time, not by the CRC-16 computation. For a fair comparison of different implementations of the SAME task, the functional interface should be the same. In fact, were I to have to use the example CRC-16 C code in an actual application, it wouldn't take me long to identify the bottleneck caused by making a function call per every two bytes of checksummed data, and change the code to eliminate that overhead; in effect, when writing the assembly code you had already performed a similar linkage optimization by passing data in registers. Almost all the assembly code I can lay my hands on around here is expected to be used from a higher-level language, because we don't agree with the notion that to gain any benefit from use of assembler it is necessary to do EVERYthing in assembler. For large projects, that would be sheer folly. Consequently, it is reasonable to expect that much code in assembler will have to meet the constraints imposed by HLL interprocedural linkage and other run-time conventions. If you were to tackle an entire project in assembler, then you avoid any specific HLL's run-time conventions, but you have to do similar bookkeeping somehow. I can attest from experience that it is harder to get those details right when using assembly language; more of the responsibility for bookkeeping must be assumed by the programmer (speaking generally) when using assembler -- which is one of the big reasons for using a higher-level language. >... as the program gets larger, so does the speed difference. This directly contradicts the observations of professional software engineers dating as far back as the 1960s. Indeed, it is taken as a virtual axiom these days that most of the execution time in spent in a minority of the code. Because most applications have one or more such identifiable bottlenecks in a small portion of the code (often in just a few lines of the HLL source code), even an infinite speedup in the majority of the code would have little effect on the overall application speed. Indeed, use of assembly language to speed up genuine bottlenecks is one of the few uses of assembly language that I would consider legitimate -- but only after the bottleneck has been clearly identified and only if other solutions are insufficient. (Jon Bentley has written some nice books on the subject of improving code efficiency.) >I can't comment on which is worse, a poorly written C program or a >poorly written assembly language program. They're both really bad. That's true, but even well-written assembly code is unlikely to be of much value for porting purposes, while some truly horrible C code has been ported all around the place. Your point about coding quality is well taken -- no matter what programming language is used, it is possible to produce a neat, maintainable job or an ugly unmaintainable mess.
rhyde@dimaggio.ucr.edu (randy hyde) (04/27/91)
>>> Most of the execution time is spent in a small percentage of the code.
Ah, the good old 90-10 rule.
I subscribe to this rule in totality. The only problem is finding that 10%
Most people assume the 10% (responsible for 90% of the execution time) is all
together in one big clump. Extract that and rewrite it in assembly and your
program magically runs almost as fast as a pure assembly language program.
I've fallen into this trap before; several times in fact. That's why I'm so
strong on assembly today.
I truly believe that 10% of the code is responsible for 90% of the
execution time.
It's just that that 10% consists of 0.5% here, 0.2 % there, 1% in that corner,
0.74% somewhere else, and so on. Often the slow code is intermixed with
the fast code in such a way that you wind up rewriting a large part of
the program in assembly anyway. I know, I've done this so many times in
the past on applications that were written in C or Pascal and needed a
considerable speed-up.
Two projects, in particular, come to mind. One was an editor, the other
was a runoff type program. On the runoff program I spent as much time
recoding portions of the program as I did writing it in Pascal in the
first place.
It ran only twice as fast when I was done, which wasn't fast enough. I
re-wrote it in assembly and it ran about ten times faster than the
hybrid version.
I had similar results with the editor. BTW, rewriting it in assembly
took less time than hacking up the Pascal code with assembly.
Now I'm talking about programs on the order of 10,000 to 50,000 lines.
I wouldn't make these same claims for a 1,000,000 line program. OTOH, I
wouldn't work on a 1,000,000 line program, either.
BTW, I have Bentley's "Writing Efficient Programs" and I assign it in a couple
of my classes. Definitely a good book regardless of your choice of language.
psonnek@pro-mansion.cts.com (Patrick Sonnek) (04/27/91)
In-Reply-To: message from gwyn@smoke.brl.mil >That's true, but even well-written assembly code is unlikely to be of >much value for porting purposes, while some truly horrible C code has >been ported all around the place. This is a good thing?!? As has been stated before by someone else (Forgive me, I've forgotten who.) I don't care if it's portable, I want the application to run on MY computer FAST! ---- ProLine: psonnek@pro-mansion Sysop Pro-mansion: 507/726-6181 Internet: psonnek@pro-mansion.cts.com MCImail: psonnek UUCP: crash!pro-mansion!psonnek ARPA: crash!pro-mansion!psonnek@nosc.mil BITNET: psonnek%pro-mansion.cts.com@nosc.mil <<Real programmers don't program in HLL's.>> <<HLL's are for wimpy application coders!!>>
stephens@latcs2.lat.oz.au (Philip J Stephens) (04/28/91)
Randy Hyde writes: >Two projects, in particular, come to mind. One was an editor, the other >was a runoff type program. On the runoff program I spent as much time >recoding portions of the program as I did writing it in Pascal in the >first place. >It ran only twice as fast when I was done, which wasn't fast enough. I >re-wrote it in assembly and it ran about ten times faster than the >hybrid version. >I had similar results with the editor. BTW, rewriting it in assembly >took less time than hacking up the Pascal code with assembly. I would agree with Randy here. People have been saying that the time to write and debug a program in assembly is much more prohibative than writing and debugging the same program in a HLL. I disagree with this point; if you're a bad programmer, then you'll do just as bad a job in a HLL as you would in assembly. But if you're a good programmer, you'll be able to write in assembly just as fast and as well as in a HLL. Of course, I do agree with the statement that you shouldn't choose assembly over an HLL unless it is obvious that you're going to run into speed problems. There's a real art to debugging an assembly language program, one that requires years of experience to develop. Not that it's all that different from debugging a HLL program; just a little trickier. <\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/><\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\> < Philip J. Stephens >< "Many views yield the truth." > < Hons. student, Computer Science >< "Therefore, be not alone." > < La Trobe University, Melbourne >< - Prime Song of the viggies, from > < AUSTRALIA >< THE ENGIMA SCORE by Sheri S Tepper > </\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\></\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/>
gwyn@smoke.brl.mil (Doug Gwyn) (04/30/91)
In article <8866@crash.cts.com> psonnek@pro-mansion.cts.com (Patrick Sonnek) writes:
-In-Reply-To: message from gwyn@smoke.brl.mil
->That's true, but even well-written assembly code is unlikely to be of
->much value for porting purposes, while some truly horrible C code has
->been ported all around the place.
-This is a good thing?!? As has been stated before by someone else (Forgive
-me, I've forgotten who.) I don't care if it's portable, I want the
-application to run on MY computer FAST!
If it's not portable, odds are it won't run on your computer at all!
If an application runs well enough, you should be happy, rather than
pointlessly wish that it ran faster.
bazyar@ernie (Jawaid Bazyar) (05/01/91)
In article <16004@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes: >If it's not portable, odds are it won't run on your computer at all! > >If an application runs well enough, you should be happy, rather than >pointlessly wish that it ran faster. Of course, the IBM solution would be to throw more hardware at the problem. In this case, I'd simply recommend one of Zip's fastastic and cheap accelerators. -- Jawaid Bazyar | "Twenty seven faces- with their eyes turned to Senior/Computer Engineering | the sky. I have got a camera, and an airtight bazyar@cs.uiuc.edu | alibi.." Apple II Forever! | I need a job... Be priviliged to pay me! :-)
psonnek@pro-mansion.cts.com (Patrick Sonnek) (05/03/91)
In-Reply-To: message from gwyn@smoke.brl.mil >If it's not portable, odds are it won't run on your computer at all! If it was written in Assembly for my computer specifically it will! >If an application runs well enough, you should be happy, rather than >pointlessly wish that it ran faster. I don't consider speed to be pointless. My time is valuable. I don't want to have to wait around for inefficient code. I think you will find my opinions about speed to be consistant with the users (as opposed to programmers) out there in the world. It has been show, on mainframes anyways, that users will find response time of 3 seconds or less acceptable, with subsecond being the optimum. Anything over 3 seconds is intolerable to the average user. So speed is very important. ---- ProLine: psonnek@pro-mansion Sysop Pro-mansion: 507/726-6181 Internet: psonnek@pro-mansion.cts.com MCImail: psonnek UUCP: crash!pro-mansion!psonnek ARPA: crash!pro-mansion!psonnek@nosc.mil BITNET: psonnek%pro-mansion.cts.com@nosc.mil <<Real programmers don't program in HLL's.>> <<HLL's are for wimpy application coders!!>>
jerry@polygen.uucp (Jerry Shekhel) (05/03/91)
psonnek@pro-mansion.cts.com (Patrick Sonnek) writes: > >This is a good thing?!? As has been stated before by someone else (Forgive >me, I've forgotten who.) I don't care if it's portable, I want the >application to run on MY computer FAST! > What a joke. If you want the application to run on your computer AT ALL, you'd better care about portability. -- +-------------------+----------------------+---------------------------------+ | JERRY J. SHEKHEL | POLYGEN CORPORATION | When I was young, I had to walk | | Drummers do it... | Waltham, MA USA | to school and back every day -- | | ... In rhythm! | (617) 890-2175 | 20 miles, uphill both ways. | +-------------------+----------------------+---------------------------------+ | ...! [ princeton mit-eddie bu sunne ] !polygen!jerry | | jerry@polygen.com | +----------------------------------------------------------------------------+
philip@utstat.uucp (Philip McDunnough) (05/04/91)
In article <8992@crash.cts.com> psonnek@pro-mansion.cts.com (Patrick Sonnek) writes: [quotes ongoing discussion of HLL's vs assembler and an opinion on the need for an application to run faster if it runs well] >I don't consider speed to be pointless. My time is valuable. I don't want to >have to wait around for inefficient code. I think you will find my opinions >about speed to be consistant with the users (as opposed to programmers) out >there in the world. I consider myself a user, and I suppose my time is valuable. This endless search for more speed certainly doesn't make me want to jump up and down. I'd rather see elegance, beauty, thought,etc...put into problems than lines of programmers/users churning out endless code and speadsheets. I'm not saying that there isn't a time and place for fast computers. I am saying it's relatively unimportant in the larger scheme of things. There are more important issues involved with computing, from a uer's point of view at least, than just getting anything to run faster. Philip McDunnough University of Toronto [my opinions,etc...]
rhyde@ucrmath.ucr.edu (randy hyde) (05/06/91)
>>>>
What a joke. If you want the application to run on your computer AT ALL,
you'd better care about portability.
<<<<
I write lots of device drivers for PCs. I can promise you they are not
portable. They work fine on my computer. I don't care if they're
portable to a Mac or an Apple II.
There is a big problem with portable "Applications." In general, they have
to be written to the lowest common denominator. I'm sorry, but people have
been arguing this once one microcomputers since the TRS-80 vs. Apple II days.
The non-portable products always won out in the marketplace. Even Word Perfect
(who has ported their application to more diverse systems than anyone else
I know of) doesn't write portable software. They've written specific
versions for each platform.
Most people won't spend a lot of money on application which barely run on
a machine (i.e., AT ALL). They want those applications to take advantage of
the features of the machine. Look at how many people around here who are
complaining because software vendors only support generic Apple II computers
and don't support the gee whiz features of the GS. Hey, it's portable if it
runs on the II. That doesn't make GS owners very happy. They'll buy such
products grudingly, if at all.
*** RAndy Hyde
dat33228@uxa.cso.uiuc.edu (Derek A. Taubert) (05/07/91)
In article <1074@stewart.UUCP> jerry@stewart.UUCP (Jerry Shekhel) writes: >psonnek@pro-mansion.cts.com (Patrick Sonnek) writes: >> >>This is a good thing?!? As has been stated before by someone else (Forgive >>me, I've forgotten who.) I don't care if it's portable, I want the >>application to run on MY computer FAST! >> > >What a joke. If you want the application to run on your computer AT ALL, >you'd better care about portability. Excuse me for returning the same arrogance that you just so wonderfully displayed, but do you even know what portability means? -- + Derek Taubert --> derek@mrcnext.cso.uiuc.edu + Author of : GScii+ + + dat33228@uxa.cso.uiuc.edu + and the world's most useless + ++++++++++++++++++++++++++++++++++++++++++++++++ desk accessory -> Amaze me + + There are MOUSE technotes? + *******8-) ++++++++++++++++++++++++++++++++ + Ask me about my GS load meter + ^^^^^^^^^^ Marge Simpson +