rhyde@ucrmath.ucr.edu (randy hyde) (04/20/91)
I finally got my act together and I'm submitting some examples of CRC-16/
CCITT algorithms. For a really good discussion of the CRC-16/CCITT algo-
rithm, please see "The C Programmer's Guide to Serial Communications" by
Campbell. This is a great book and should be in the hands of anyone who
needs to deal with any form of serial communications, regardless of the
language they happen to program in.
A *brief* description of CRC-16/CCITT: There are many CRC (cyclic
redundancy check) algorithms laying around. For example, someone posted
the CRC algorithm used by MINIX on the net here, it is not the same as
the standard CRC-16/CCITT algorithm (I cannot verify this since I do not
yet have Minix sources, the post could have been wrong). CRC algorithms
are used like checksums and parity-- they're used to verify the correct-
ness of a block of data. The CRC-16/CCITT algorithm has many advantages
over checksums and parity checks. Such a description is beyond the scope
of this article. See the reference above for more details. Also, the
Aug/Sep 1990 C User's Journal had several articles about CRC's and other
data checking algorithms dealing with serial communications.
The CRC-16/CCITT algorithm basically computes the remainder of an n-bit
number divided by 0x1021 using 1's complement arithmetic (mod 2). This
algorithm was chosen because it is fairly reliable at detecting all sorts
of errors *and* (most importantly) it is very easy to implement in
hardware. CRC's were originally designed to verify data coming off a
disk drive. Since the data comes off the disk at a rapid rate, software
(at the time) was unable to keep up with it. Therefore, they needed to
compute the CRC using hardware. A 1's complement division by 0x1021 is
easy to implement using a couple of shift registers and XOR gates.
Therefore, a CRC-16 algorithm is essentially a long division algorithm
which uses XOR rather than subtraction at each step in the division. It
throws away the quotient and keeps the remainder. There is actually a
little more to it than this, but that's close enough for now.
To compute the CRC-16/CCITT value (I'll just call it CRC from here on out)`
you initialize the crc value to zero (many modern algorithms initialize it
to 0xffff instead, for some very good reasons). You then grab each byte
from the message and, one bit at a time, xor the H.O. bit of the data
with the H.O. bit of the current checksum. If the result is zero, you
simply shift the crc (16-bits) and the data (8 bits) one position to the
left. If the XOR of the H.O. bits is one, you still shift the data byte
one position to the left, the crc value, however, you must shift and then
xor with 0x1021 (the CCITT "polynomial"). You repeat this for all eight
bits and then repeat the process for the entire byte stream of the message.
The following C(++) code does this:
#include <stdio.h>
int crc(int data, int curcrc)
{
static int i;
for (i=0; i<8; ++i)
{
if ((data << 8) ^ curcrc)
curcrc = (curcrc << 1) ^ 0x1021;
else curcrc <<= 1;
data <<= 1;
}
return curcrc;
}
int main()
{
int i,j,c;
for (i=0; i < 10000; ++i)
{
c = 0;
for (j=0; j < 256; ++j)
c = crc(j,c);
}
puts("done");
}
BTW, there is a minor bug (affecting portability) in this code. Can you
spot it? I left this bug in to make the above code much easier to read.
I will present a version of this code in a moment with the bug removed.
Running time for this algorithm is 3:39 on a 12Mhz 80286 (I had my
laptop with me at Cal Poly yesterday while working on this; I apologize
for not doing this on a GS, but my GS isn't exactly portable). Note: I
used Borland`s BC (C++) compiler, emitting 8086 code with optimizations
turned on (although I doubt optimization would make any difference on
such a small, simple, program)
Implemented in (8086) assembly language, the above program looks like the
following:
include stdlib.a
;****************************************************************************
;
;
;
;
;
cseg segment para public 'code'
assume cs:cseg, ds:dseg
;
;
public PSP
PSP dw ? ;Must be in CODE segment!
;
;
; Some useful constants:
;
cr equ 13
lf equ 10
eos equ 0
;
true equ 1
false equ 0
;
;
; Main is the main program. Program execution always begins here.
;
Main proc
mov cs:PSP, es ;Save pgm seg prefix
mov ax, seg dseg ;Set up the segment registers
mov ds, ax
mov es, ax
mov dx, 0 ;Allocate all available RAM.
MemInit
;
;
mov cx, 10000
crclp1: xor dx, dx ;Set DX to zero.
mov ax, dx
crclp2: call crc
inc al
jne crclp2
dec cx
jne crclp1
print
db "done",0
;
;
;
Quit: mov ah, 4ch
int 21h
;
Main endp
;
;
; CRC routine:
;
; Entry:
; DX contains current crc value
; AL contains new character
;
; Exit:
; DX contains new crc value
;
crc proc near
push cx
push ax
mov cx, 8 ;Repeat once for each bit.
BitLoop: mov ah, al ;Get a copy of data value.
xor ah, dh ;This clears the carry!
jns NoCCITT ;XOR of H.O. bits = 1?
xor dx, 0810h ;1021h shr 1
stc ;Underflow from 1021 shr 1
;
; The following rcl is just a SHL if we branched from above (since the XOR
; instruction clears the carry flag). It is an SHL which shifts in a 1
; if we fall through from above. Shifting in a one supplies the L.O. bit
; of 1021h which we XOR'd in above.
;
NoCCITT: rcl dx, 1
shl al, 1 ;Go to the next bit in data.
loop BitLoop
pop ax
pop cx
ret
crc endp
;
;
cseg ends
;
;
; Allocate a reasonable amount of space for the stack (2k).
;
sseg segment para stack 'stack'
stk db 256 dup ("stack ")
sseg ends
;
;
end Main
Again, I apologize for 8086 assembly, it's all I had available at the
time. Running time for this code on the same machine as above: 1:19.
The above are the simple and easy to understand versions of the CRC
algorithm. Alas, they are very slow. Keep in mind, the CRC algorithm
may need to keep up with a fast serial line (or some other fast device).
We certainly don't want to lose characters because of CRC computation.
Before jumping into major optimizations, let's consider a very common
optimization that just about any assembly language programmer would
think of and good "C" programmers would think of: unravelling (unrolling)
the loops. The loop in the crc function can be easily unrolled since it
executes exactly eight times each time you call it. Since the function
is rather simple, why bother with the function call? Why not include
the code in-line and dispense with the overhead of the call? The following
implement this:
#include <stdio.h>
#define crc18(data, curcrc) if ((data << 8) ^ curcrc) \
curcrc = ((curcrc << 1) ^ 0x1021) & 0xffff; \
else curcrc = (curcrc << 1) & 0xffff;\
data = (data << 1) & 0xff00;
int crc(int data, int curcrc)
{
static int i;
data <<= 8;
for (i=0; i<8; ++i)
{
if (data ^ curcrc)
curcrc = ((curcrc << 1) ^ 0x1021) & 0xffff;
else curcrc = (curcrc << 1) & 0xffff;
data = (data << 1) & 0xff00;
}
return curcrc;
}
int main()
{
int i,j,k,c;
for (i=0; i < 10000; ++i)
{
c = 0;
for (j=0; j < 256; ++j)
c = crc(j,c);
}
puts("done, press return for macro version");
c = getchar();
for (i=0; i < 10000; ++i)
{
c = 0;
for (j=0; j < 256; ++j)
{
k=j;
crc18(c,k);
crc18(c,k);
crc18(c,k);
crc18(c,k);
crc18(c,k);
crc18(c,k);
crc18(c,k);
crc18(c,k);
}
}
puts("done");
}
This code implements *both* the function call and the unrolled version.
If you didn't notice the bug I corrected, shorts and ints in C have no
guaranteed size other than they are at least 16 bits long (ANSI C).
You have no guarantee that shorts are 16 bits long, which is why we
need the and with 0xffff above. If your C compiler has 16-bit shorts,
I would hope that it would optimize out this operation.
Running time for this code: 2:02 (at least for the in-line version).
A substantial improvement.
Consider the same code in 8086 assembly:
include stdlib.a
;****************************************************************************
;
Crc18 macro
local NoCCITT
mov ah, al
xor ah, dh ;This clears the carry!
jns NoCCITT
xor dx, 0810h ;1021h shr 1
stc ;Underflow from 1021 shr 1
NoCCITT: rcl dx, 1
shl al, 1
endm
;
;
;
cseg segment para public 'code'
assume cs:cseg, ds:dseg
;
;
public PSP
PSP dw ? ;Must be in CODE segment!
;
;
; Some useful constants:
;
cr equ 13
lf equ 10
eos equ 0
;
true equ 1
false equ 0
;
;
; Main is the main program. Program execution always begins here.
;
Main proc
mov cs:PSP, es ;Save pgm seg prefix
mov ax, seg dseg ;Set up the segment registers
mov ds, ax
mov es, ax
mov dx, 0 ;Allocate all available RAM.
MemInit
;
;
mov cx, 10000
crclp1: xor dx, dx ;Set DX to zero.
mov ax, dx
crclp2: push ax
crc18
crc18
crc18
crc18
crc18
crc18
crc18
crc18
pop ax
inc al
jne crclp2
dec cx
jne crclp1
print
db "done",0
;
;
;
Quit: mov ah, 4ch
int 21h
;
Main endp
;
;
;
;
cseg ends
;
;
; Allocate a reasonable amount of space for the stack (2k).
;
sseg segment para stack 'stack'
stk db 256 dup ("stack ")
sseg ends
;
;
;
end Main
Running time for this code: 0:48. Again a substantial improvement.
I would like to point out that the above algorithms are about as far
as you can take the CRC-16/CCITT algorithm and still have code that
is easily recognizable for what it does. Additional optimizations will
destroy the readability of the code. Nonetheless, even 48 seconds may
be a little too slow for someone wanting to do 2,560,000 CRC computations.
Let's explore the next step.
Any assembly language programmer worth his salt would immediately suggest
"table look-up" as a means to optimize the program. A good C programmer
would also make this suggestion (although you see this technique used
*far* less often in C than in assembly). The code Doug posted used this
technique to speed things up (which is why the C code was very difficult
to follow). A modified version of that code (modified so that the calling
sequence is similar to the above routines) follows:
#include <iostream.h>
unsigned crctab[2][16] = {
0x0000, 0xc0c1, 0xc181, 0x0140, 0xc301, 0x03c0, 0x0280, 0xc241,
0xc601, 0x06c0, 0x0780, 0xc741, 0x0500, 0xc5c1, 0xc481, 0x0440,
0x0000, 0xcc01, 0xd801, 0x1400, 0xf001, 0x3c00, 0x2800, 0xe401,
0xa001, 0x6c00, 0x7800, 0xb401, 0x5000, 0x9c01, 0x8801, 0x4400};
int crc(int data, int curcrc)
{
static int temp;
temp = curcrc ^ data;
return crctab[0] [temp & 0xf] ^ crctab[1][(temp >>4) & 0xf] ^
(curcrc >> 8);
}
int main()
{
int i,j,c;
for (i=0; i < 10000; ++i)
{
c = 0;
for (j=0; j < 256; ++j)
c = crc(j,c);
}
cout << "done";
}
This code is an interesting compromise between space and speed. Rather
than use a 256 entry lookup table indexed by each byte of data, this
algorithm uses the two nibbles of each data byte as an index into two
halves of a 2D matrix (one side for L.O. nibble, one side for H.O. nibble).
Someone had to think long and hard to figure this one out, but more on
that later.
Running time: 0:47 A *very* substantial improvement over the bit at a
time algorithm.
A couple of comments about the above algorithm- the use of a 2D array is
weird. On a compiler which doesn't optimize things too well, this would
cost you dearly (since it would require a multiply and add to compute the
address of the selected array element). Why the original author didn't
use two separate arrays escapes me. This is probably a case of not
realizing the underlying data structure (e.g., they should have known more
assembly) or they didn't care about performance across architectures and
compilers.
I cannot imagine an assembly language programmer attacking this problem in
this fashion. Therefore, I didn't bother implemented it. Most assembly
language programmers would go directly to a 256-element look up table and
use the data byte as an index into that table. That's what the next
section of code does:
If you're willing to live with a 256 element array rather than a 32-element
array, the following C code is even better:
#include <iostream.h>
unsigned crctab[256] = {
0x0000,
0x1021,0x2042,0x3063,0x4084,0x50A5,0x60C6,0x70E7,0x8108,
0x9129,0xA14A,0xB16B,0xC18C,0xD1AD,0xE1CE,0xF1EF,0x1231,
0x0210,0x3273,0x2252,0x52B5,0x4294,0x72F7,0x62D6,0x9339,
0x8318,0xB37B,0xA35A,0xD3BD,0xC39C,0xF3FF,0xE3DE,0x2462,
0x3443,0x0420,0x1401,0x64E6,0x74C7,0x44A4,0x5485,0xA56A,
0xB54B,0x8528,0x9509,0xE5EE,0xF5CF,0xC5AC,0xD58D,0x3653,
0x2672,0x1611,0x0630,0x76D7,0x66F6,0x5695,0x46B4,0xB75B,
0xA77A,0x9719,0x8738,0xF7DF,0xE7FE,0xD79D,0xC7BC,0x48C4,
0x58E5,0x6886,0x78A7,0x0840,0x1861,0x2802,0x3823,0xC9CC,
0xD9ED,0xE98E,0xF9AF,0x8948,0x9969,0xA90A,0xB92B,0x5AF5,
0x4AD4,0x7AB7,0x6A96,0x1A71,0x0A50,0x3A33,0x2A12,0xDBFD,
0xCBDC,0xFBBF,0xEB9E,0x9B79,0x8B58,0xBB3B,0xAB1A,0x6CA6,
0x7C87,0x4CE4,0x5CC5,0x2C22,0x3C03,0x0C60,0x1C41,0xEDAE,
0xFD8F,0xCDEC,0xDDCD,0xAD2A,0xBD0B,0x8D68,0x9D49,0x7E97,
0x6EB6,0x5ED5,0x4EF4,0x3E13,0x2E32,0x1E51,0x0E70,0xFF9F,
0xEFBE,0xDFDD,0xCFFC,0xBF1B,0xAF3A,0x9F59,0x8F78,0x9188,
0x81A9,0xB1CA,0xA1EB,0xD10C,0xC12D,0xF14E,0xE16F,0x1080,
0x00A1,0x30C2,0x20E3,0x5004,0x4025,0x7046,0x6067,0x83B9,
0x9398,0xA3FB,0xB3DA,0xC33D,0xD31C,0xE37F,0xF35E,0x02B1,
0x1290,0x22F3,0x32D2,0x4235,0x5214,0x6277,0x7256,0xB5EA,
0xA5CB,0x95A8,0x8589,0xF56E,0xE54F,0xD52C,0xC50D,0x34E2,
0x24C3,0x14A0,0x0481,0x7466,0x6447,0x5424,0x4405,0xA7DB,
0xB7FA,0x8799,0x97B8,0xE75F,0xF77E,0xC71D,0xD73C,0x26D3,
0x36F2,0x0691,0x16B0,0x6657,0x7676,0x4615,0x5634,0xD94C,
0xC96D,0xF90E,0xE92F,0x99C8,0x89E9,0xB98A,0xA9AB,0x5844,
0x4865,0x7806,0x6827,0x18C0,0x08E1,0x3882,0x28A3,0xCB7D,
0xDB5C,0xEB3F,0xFB1E,0x8BF9,0x9BD8,0xABBB,0xBB9A,0x4A75,
0x5A54,0x6A37,0x7A16,0x0AF1,0x1AD0,0x2AB3,0x3A92,0xFD2E,
0xED0F,0xDD6C,0xCD4D,0xBDAA,0xAD8B,0x9DE8,0x8DC9,0x7C26,
0x6C07,0x5C64,0x4C45,0x3CA2,0x2C83,0x1CE0,0x0CC1,0xEF1F,
0xFF3E,0xCF5D,0xDF7C,0xAF9B,0xBFBA,0x8FD9,0x9FF8,0x6E17,
0x7E36,0x4E55,0x5E74,0x2E93,0x3EB2,0x0ED1,0x1EF0};
int crc(int data, int curcrc)
{
static int comb_val;
comb_val = (curcrc >> 8) ^ data;
return (curcrc << 8) ^ crctab[comb_val];
}
int main()
{
int i,j,c;
for (i=0; i < 10000; ++i)
{
c = 0;
for (j=0; j < 256; ++j)
c = crc(j,c);
}
cout << "done";
}
Running time: 0:39; again, a substantial improvement. Now look at the
same code in 8086 assembly:
include stdlib.a
;****************************************************************************
;
;
; Global variables go here:
;
dseg segment para public 'data'
;
.radix 16
crctbl dw 00000
dw 01021,02042,03063,04084,050A5,060C6,070E7,08108
dw 09129,0A14A,0B16B,0C18C,0D1AD,0E1CE,0F1EF,01231
dw 00210,03273,02252,052B5,04294,072F7,062D6,09339
dw 08318,0B37B,0A35A,0D3BD,0C39C,0F3FF,0E3DE,02462
dw 03443,00420,01401,064E6,074C7,044A4,05485,0A56A
dw 0B54B,08528,09509,0E5EE,0F5CF,0C5AC,0D58D,03653
dw 02672,01611,00630,076D7,066F6,05695,046B4,0B75B
dw 0A77A,09719,08738,0F7DF,0E7FE,0D79D,0C7BC,048C4
dw 058E5,06886,078A7,00840,01861,02802,03823,0C9CC
dw 0D9ED,0E98E,0F9AF,08948,09969,0A90A,0B92B,05AF5
dw 04AD4,07AB7,06A96,01A71,00A50,03A33,02A12,0DBFD
dw 0CBDC,0FBBF,0EB9E,09B79,08B58,0BB3B,0AB1A,06CA6
dw 07C87,04CE4,05CC5,02C22,03C03,00C60,01C41,0EDAE
dw 0FD8F,0CDEC,0DDCD,0AD2A,0BD0B,08D68,09D49,07E97
dw 06EB6,05ED5,04EF4,03E13,02E32,01E51,00E70,0FF9F
dw 0EFBE,0DFDD,0CFFC,0BF1B,0AF3A,09F59,08F78,09188
dw 081A9,0B1CA,0A1EB,0D10C,0C12D,0F14E,0E16F,01080
dw 000A1,030C2,020E3,05004,04025,07046,06067,083B9
dw 09398,0A3FB,0B3DA,0C33D,0D31C,0E37F,0F35E,002B1
dw 01290,022F3,032D2,04235,05214,06277,07256,0B5EA
dw 0A5CB,095A8,08589,0F56E,0E54F,0D52C,0C50D,034E2
dw 024C3,014A0,00481,07466,06447,05424,04405,0A7DB
dw 0B7FA,08799,097B8,0E75F,0F77E,0C71D,0D73C,026D3
dw 036F2,00691,016B0,06657,07676,04615,05634,0D94C
dw 0C96D,0F90E,0E92F,099C8,089E9,0B98A,0A9AB,05844
dw 04865,07806,06827,018C0,008E1,03882,028A3,0CB7D
dw 0DB5C,0EB3F,0FB1E,08BF9,09BD8,0ABBB,0BB9A,04A75
dw 05A54,06A37,07A16,00AF1,01AD0,02AB3,03A92,0FD2E
dw 0ED0F,0DD6C,0CD4D,0BDAA,0AD8B,09DE8,08DC9,07C26
dw 06C07,05C64,04C45,03CA2,02C83,01CE0,00CC1,0EF1F
dw 0FF3E,0CF5D,0DF7C,0AF9B,0BFBA,08FD9,09FF8,06E17
dw 07E36,04E55,05E74,02E93,03EB2,00ED1,01EF0;
;
.radix 10
dseg ends
;
;
;
;
cseg segment para public 'code'
assume cs:cseg, ds:dseg
;
;
;
public PSP
PSP dw ? ;Must be in CODE segment!
;
;
;
Main proc
mov cs:PSP, es ;Save pgm seg prefix
mov ax, seg dseg ;Set up the segment registers
mov ds, ax
mov es, ax
;
mov cx, 10000
crclp1: xor dx, dx ;Set DX to zero.
mov ax, dx
crclp2: call crc
inc al
jne crclp2
dec cx
jne crclp1
print
db "done",0
;
;
Quit: mov ah, 4ch
int 21h
;
Main endp
;
;
; CRC update routine.
;
; On Entry:
; AL- New data value
; DX- Current CRC value
;
; On Exit:
; DX- New CRC value
;
crc proc near
mov bl, dh
xor bl, al
mov bh, 0
mov dh, dl
mov dl, bh
shl bx, 1
xor dx, crctbl[bx]
ret
crc endp
;
;
;
cseg ends
;
;
; Allocate a reasonable amount of space for the stack (2k).
;
sseg segment para stack 'stack'
stk db 256 dup ("stack ")
sseg ends
;
;
end Main
Running time: 0:17
I could probably get this down to 0:15 or 0:16 if I encoded the CRC
calculation in-line rather than as a procedure/function.
Assembly language typically works out a little better than twice as
fast. The only surprise here is that it's actually that good. For
such simple, short, routines, ASM's advantage over C usually isn't this
good. For a CRC algorithm, speed is very important, so ASM is, perhaps,
a better choice.
But that wasn't the purpose of doing all this. Earlier (on the net), I'd
made the statement that for a CRC algorithm, ASM isn't particularly more
difficult to understand than C (and this is true, in general, for most
bit-pushing operations). Now I'm assuming we're talking about expert
C programmers looking at C code and expert assembly language programmers
looking at ASM code. It isn't fair to take an expert C programmer who is
a mediocre ASM programmer and have that same person evaluate C and ASM
code.
I claim the ASM code above isn't any more difficult to understand than the
C code (assuming, of course, that you know the algorithm that the code
is attempting to implement).
In Doug's original post, he claimed that the ASM code wasn't any more
readable than the C code. The reasons are two-fold: (1) the code used
a truly miserable algorithm which was next to impossible to understand
in the first place (*NO* language is going to make that algorithm
particularly clear), (2) the ASM code looked like the output of a
compiler. I would be ashamed of myself if I wrote assembly code that
looked like that. Clearly compiler-generated assembly code is *not*
going to be clearer to understand than hand written code. My apologizes
if I'm offending an actual human author of that code.
One final point- Note that my minor optimizations in assembly language`
(unrolling the loop) ran at the same speed as the nibble loop-up table
optimized code. For the same speed, I claim my assembly code is *much*
easier to understand than the optimized C code. From a development
point of view, it would take far more time to dream up, implement, and
verify the nibble-based C code that Doug posted than it would take to
implement the unrolled version of the original CRC algorithm in assembly.
I offer this as proof that assembly is often the better choice for a
project. Sure, you can find better algorithms in C which run faster than
bad algorithms in ASM, but the end result is that you spend *far* more
time designing, testing, and debugging the C code. And if you can find
a better algorithm that works in C, that same algorithm will work in
assembly as well.
Keep in mind, that the CRC algorithm is intrinsically O(n). You're not
going to do any better than this because it takes n operations to read
the data. Therefore, the only thing we can attack is the constant
in the complexity of this algorithm. ASM is really good at this.
*** Randy Hydegwyn@smoke.brl.mil (Doug Gwyn) (04/21/91)
In article <13688@ucrmath.ucr.edu> rhyde@ucrmath.ucr.edu (randy hyde) writes: >Assembly language typically works out a little better than twice as fast. But the comparison was not really fair, since the assembly-language example implementation did not support the same linkage as the C code. For a function that performs so little work, function linkage overhead is an important factor. Also, the example C code was not optimal for the given algorithm. By making a few minor tweaks to it I reduced the C example run time by a factor of 26% on the host on which I read net news. Profiling the code showed that approximately as much time was being spent in the main program's loop overhead as in the crc function itself. For the revised crc function C source code: int crc(data, curcrc) int data; register unsigned int curcrc; { return (curcrc << 8) ^ crctab[(curcrc >> 8) ^ data]; } a fairly ordinary old-technology C compiler on a Gould PowerNode produced the following machine code: _crc: enter 8w,0x5 -- function linkage movw 9w+8w[b2],r1 -- pick up "curcrc" argument movw r1,r0 -- make a second copy of it lslw #8,r0 -- shift one copy left (accumulator) lsrw #8,r1 -- shift the other copy right xorw 8w+8w[b2],r1 -- XOR shifted copy with "data" argument lslw #2,r1 -- shift for indexing crctab[] xorw _crctab[r1],r0 -- XOR table entry with accumulator retn -- function linkage and a similar compiler on a DEC VAX produced the following machine code: _crc: .word 0x800 ; function linkage movl 8(ap),r11 ; pick up "curcrc" argument extzv $8,$24,r11,r0 ; shift one copy left (accumulator) xorl2 4(ap),r0 ; XOR shifted copy with "data" argument ashl $8,r11,r1 ; shift the other copy right xorl3 r1,_crctab[r0],r0 ; XOR table entry with accumulator ret ; function linkage Note that there isn't very much that one could do even by hand-coded assembly language to perform this computation much more efficiently. (Except that the VAX actually has microcode support for a CRC instruction! That is an example of a situation in which use of assembly language could be justified, since it is somewhat rare for a C compiler to exploit such specialized instructions. Another possible example would be in a bottleneck loop when testing the "carry bit" would be much more efficient than trying to do without it; originally I thought that was going to be the justification for coding the CRC algorithm in assembler.) >I claim the ASM code above isn't any more difficult to understand than the >C code (assuming, of course, that you know the algorithm that the code >is attempting to implement). For such a small function, the conceptual burden is indeed manageable. For a larger example involving linked data structures, etc. I would prefer to deal with C code over assembly code. >From a development point of view, it would take far more time to dream >up, implement, and verify the nibble-based C code that Doug posted than >it would take to implement the unrolled version of the original CRC >algorithm in assembly. But that's comparing apples and oranges. For the same algorithm with the same constraints (particularly function linkage), I don't think you've shown that assembly language code is appreciably faster than C code, taking into account my earlier observations. >I offer this as proof that assembly is often the better choice for a >project. Sure, you can find better algorithms in C which run faster than >bad algorithms in ASM, but the end result is that you spend *far* more >time designing, testing, and debugging the C code. And if you can find >a better algorithm that works in C, that same algorithm will work in >assembly as well. The C code, if you do it right, is immediately usable on a variety of machines, and development cost of the algorithm is independent of the implementation language. Implementation of the algorithm in C is no harder than in assembly language, and for complex algorithms it is very nice to be able to rely on the compiler for help with the bookkeeping. Why do you think high-level languages were invented? >Keep in mind, that the CRC algorithm is intrinsically O(n). You're not >going to do any better than this because it takes n operations to read >the data. Therefore, the only thing we can attack is the constant >in the complexity of this algorithm. ASM is really good at this. And if you look at the above C compiler output, you'll see that C is also really good at this, at least on decent machine architectures. As I have previously remarked, there are legitimate uses for assembler, but since C is about as fast, infinitely more portable, and easier to use for large projects, I recommend C over assembler in the majority of situations.
mdavis@pro-sol.cts.com (Morgan Davis) (04/22/91)
In-Reply-To: message from rhyde@ucrmath.ucr.edu
Excellent article on CRC's, Randy. This is one to tuck away in the
programmer's notes for sure.
I went through this kind of wracking to find a more optimal algorithm. The
table method is, by far, the best when it comes to speed. Though,
sometimes just a quick and dirty (or maybe that should be "small and
dirty") variation is to use the following:
CalcCRC (ptr, count)
char *ptr;
word count;
{
register word x;
do {
CRC ^= *ptr++ << 8; /* XOR hi-byte with data */
for (x = 8; x; --x)
if (CRC & 0x8000)
CRC = CRC << 1 ^ 0x1021;
else
CRC <<= 1;
} while (--count);
}
This function operates on a global named CRC. You call it by pointing to a
data buffer, and passing the size of the buffer. After calling CaclCRC()
the CRC global integer is updated accordingly.
The optimizations here that differ from the original, frequently published
version is that it:
o Counts from 8 bits down to none, so the loop
test simply involves a test for zero instead of
comparing to 8.
o Bit 15 of the CRC is tested with the bitwise AND
operation instead of shifting it around to determine
if the CRC needs the XOR treatment.
o It operates on a buffer, rather than having to pass
it a byte of data to work on, one at a time.
This optimization was done with the 65816 in mind. So these kinds of
optimizations may not make any difference on other processors, but I
doubt if they would hurt.
BTW, the 65816 variant on the above is:
CalcCRC ldy #0 ;index into buffer
newbyte shortm ;go to 8-bits
lda [ptr],y ;grab some data
iny ;prepare for next byte
eor CRC+1 ;fix high-order byte of CRC
sta CRC+1
longm ;back to 16-bit
ldx #8 ;init shift counter
shift asl CRC ;always shift it once
bcc next ;if bit 15 clear, skip the XOR
lda CRC ;XOR CRC with $1021
eor #$1021
sta CRC
next dex
bne shift ;do this 8 times
dec count ;more bytes to do?
bne newbyte ;yes
rts
--Morgan
UUCP: crash!pro-sol!mdavis AOL, BIX: mdavis
ARPA: crash!pro-sol!mdavis@nosc.mil GEnie: m.davis42
INET: mdavis@pro-sol.cts.com ProLine: mdavis@pro-soltoddpw@nntp-server.caltech.edu (Todd P. Whitesel) (04/22/91)
Morgan, if you can live with 512 bytes of CRC tables then you can do block CRC's in something like 25 cycles per byte CRC'd... lda CRC xba sep #$30 crclp eor [data],y tax lda tblL,x xba eor tblH,x iny bne crclp rep #$30 xba sta CRC will do 256 bytes at a time, with a best case of 24 cycles/byte and a worst of 27. (Aligning the DP saves 1, aligning tblH and tblL guarantees 4 cycles each.) I know the table is generated by something simple like (sombody check me): tbl[byte] = updcrc(byte, 0); where updcrc(byte, oldcrc) does not use the table, and tbl is split into high and low byte arrays for the assembly fragment above. Todd Whitesel toddpw @ tybalt.caltech.edu
rhyde@gibson.ucr.edu (randy hyde) (04/23/91)
You keep claiming it isn't fair to compare an assembly language code routine
to a C routine when the assembly language code doesn't support C linkages.
Why would someone writing their code in assembly language use such linkages?
Of course if you were writing a single routine to link into a big C program.
But I'm talking about someone writing the entire program in C.
Clearly a small program is easier to manage and understand, regardless of the
language. But the performance benefits of small programs side with HLLs as
well. In the CRC example I have, the ASM program was only twice as fast as
the C version. My experience on the 8086 is that the larger the program, the
larger the performance difference. This is being reduced on newer processors
as companies like Motorola and Intel optimize their instruction sets for those
30% of the instructions that ***C and Pascal*** compilers use. It's still true
in general though, that as the program gets larger, so does the speed
difference.
So far, my experience with well-written C programs and well-written asm
programs
is that they're both easy for me to understand (regardless of the size). The
only problem is that very few people write well-written C or ASM
programs. Don't
try to tell me it's easier to write well-written C programs. My
experience shows
that either (1) it is not, or (2) C programmers aren't as good as ASM
programmers
because even though it's easier, they don't write better C programs. Most C
programs are a mess. I don't want flames from people saying "*I* write easy to
read C programs." Anyone can post a single example. In general, few people
(including myself) write easy to read and understand C programs all the time.
I can't comment on which is worse, a poorly written C program or a
poorly written assembly language program. They're both really bad.
>> I recommend C... for the majority of projects...
Surprise! So do I. Commercial software products is the one area where
I draw the line though. Most projects are not commercial (by my
definition, mass market consumer) products. This is a situation where
performance is more important than portability and, arguably, more
important than maintainability.
Actually, I should revise the above statement, I don't recommend C for
the majority of projects, actually I recommend an appropriate HLL for
the job; this is not always C. Most projects I recommend languages for
are not in the class of commercial products.
*** Randy Hdyegwyn@smoke.brl.mil (Doug Gwyn) (04/24/91)
In article <13769@ucrmath.ucr.edu> rhyde@gibson.ucr.edu (randy hyde) writes: >You keep claiming it isn't fair to compare an assembly language code routine >to a C routine when the assembly language code doesn't support C linkages. >Why would someone writing their code in assembly language use such linkages? >Of course if you were writing a single routine to link into a big C program. >But I'm talking about someone writing the entire program in C. The reason is that you need to impose identical constraints ("requirements") on the task if you want to have a fair comparison. The C-generated machine code that I posted shows that at least some C compilers generate fairly decent instruction sequences for the algorithm itself. The timing for the CRC-16 example was dominated by "overhead" time, not by the CRC-16 computation. For a fair comparison of different implementations of the SAME task, the functional interface should be the same. In fact, were I to have to use the example CRC-16 C code in an actual application, it wouldn't take me long to identify the bottleneck caused by making a function call per every two bytes of checksummed data, and change the code to eliminate that overhead; in effect, when writing the assembly code you had already performed a similar linkage optimization by passing data in registers. Almost all the assembly code I can lay my hands on around here is expected to be used from a higher-level language, because we don't agree with the notion that to gain any benefit from use of assembler it is necessary to do EVERYthing in assembler. For large projects, that would be sheer folly. Consequently, it is reasonable to expect that much code in assembler will have to meet the constraints imposed by HLL interprocedural linkage and other run-time conventions. If you were to tackle an entire project in assembler, then you avoid any specific HLL's run-time conventions, but you have to do similar bookkeeping somehow. I can attest from experience that it is harder to get those details right when using assembly language; more of the responsibility for bookkeeping must be assumed by the programmer (speaking generally) when using assembler -- which is one of the big reasons for using a higher-level language. >... as the program gets larger, so does the speed difference. This directly contradicts the observations of professional software engineers dating as far back as the 1960s. Indeed, it is taken as a virtual axiom these days that most of the execution time in spent in a minority of the code. Because most applications have one or more such identifiable bottlenecks in a small portion of the code (often in just a few lines of the HLL source code), even an infinite speedup in the majority of the code would have little effect on the overall application speed. Indeed, use of assembly language to speed up genuine bottlenecks is one of the few uses of assembly language that I would consider legitimate -- but only after the bottleneck has been clearly identified and only if other solutions are insufficient. (Jon Bentley has written some nice books on the subject of improving code efficiency.) >I can't comment on which is worse, a poorly written C program or a >poorly written assembly language program. They're both really bad. That's true, but even well-written assembly code is unlikely to be of much value for porting purposes, while some truly horrible C code has been ported all around the place. Your point about coding quality is well taken -- no matter what programming language is used, it is possible to produce a neat, maintainable job or an ugly unmaintainable mess.
rhyde@dimaggio.ucr.edu (randy hyde) (04/27/91)
>>> Most of the execution time is spent in a small percentage of the code.
Ah, the good old 90-10 rule.
I subscribe to this rule in totality. The only problem is finding that 10%
Most people assume the 10% (responsible for 90% of the execution time) is all
together in one big clump. Extract that and rewrite it in assembly and your
program magically runs almost as fast as a pure assembly language program.
I've fallen into this trap before; several times in fact. That's why I'm so
strong on assembly today.
I truly believe that 10% of the code is responsible for 90% of the
execution time.
It's just that that 10% consists of 0.5% here, 0.2 % there, 1% in that corner,
0.74% somewhere else, and so on. Often the slow code is intermixed with
the fast code in such a way that you wind up rewriting a large part of
the program in assembly anyway. I know, I've done this so many times in
the past on applications that were written in C or Pascal and needed a
considerable speed-up.
Two projects, in particular, come to mind. One was an editor, the other
was a runoff type program. On the runoff program I spent as much time
recoding portions of the program as I did writing it in Pascal in the
first place.
It ran only twice as fast when I was done, which wasn't fast enough. I
re-wrote it in assembly and it ran about ten times faster than the
hybrid version.
I had similar results with the editor. BTW, rewriting it in assembly
took less time than hacking up the Pascal code with assembly.
Now I'm talking about programs on the order of 10,000 to 50,000 lines.
I wouldn't make these same claims for a 1,000,000 line program. OTOH, I
wouldn't work on a 1,000,000 line program, either.
BTW, I have Bentley's "Writing Efficient Programs" and I assign it in a couple
of my classes. Definitely a good book regardless of your choice of language.psonnek@pro-mansion.cts.com (Patrick Sonnek) (04/27/91)
In-Reply-To: message from gwyn@smoke.brl.mil >That's true, but even well-written assembly code is unlikely to be of >much value for porting purposes, while some truly horrible C code has >been ported all around the place. This is a good thing?!? As has been stated before by someone else (Forgive me, I've forgotten who.) I don't care if it's portable, I want the application to run on MY computer FAST! ---- ProLine: psonnek@pro-mansion Sysop Pro-mansion: 507/726-6181 Internet: psonnek@pro-mansion.cts.com MCImail: psonnek UUCP: crash!pro-mansion!psonnek ARPA: crash!pro-mansion!psonnek@nosc.mil BITNET: psonnek%pro-mansion.cts.com@nosc.mil <<Real programmers don't program in HLL's.>> <<HLL's are for wimpy application coders!!>>
stephens@latcs2.lat.oz.au (Philip J Stephens) (04/28/91)
Randy Hyde writes: >Two projects, in particular, come to mind. One was an editor, the other >was a runoff type program. On the runoff program I spent as much time >recoding portions of the program as I did writing it in Pascal in the >first place. >It ran only twice as fast when I was done, which wasn't fast enough. I >re-wrote it in assembly and it ran about ten times faster than the >hybrid version. >I had similar results with the editor. BTW, rewriting it in assembly >took less time than hacking up the Pascal code with assembly. I would agree with Randy here. People have been saying that the time to write and debug a program in assembly is much more prohibative than writing and debugging the same program in a HLL. I disagree with this point; if you're a bad programmer, then you'll do just as bad a job in a HLL as you would in assembly. But if you're a good programmer, you'll be able to write in assembly just as fast and as well as in a HLL. Of course, I do agree with the statement that you shouldn't choose assembly over an HLL unless it is obvious that you're going to run into speed problems. There's a real art to debugging an assembly language program, one that requires years of experience to develop. Not that it's all that different from debugging a HLL program; just a little trickier. <\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/><\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\> < Philip J. Stephens >< "Many views yield the truth." > < Hons. student, Computer Science >< "Therefore, be not alone." > < La Trobe University, Melbourne >< - Prime Song of the viggies, from > < AUSTRALIA >< THE ENGIMA SCORE by Sheri S Tepper > </\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\></\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/>
gwyn@smoke.brl.mil (Doug Gwyn) (04/30/91)
In article <8866@crash.cts.com> psonnek@pro-mansion.cts.com (Patrick Sonnek) writes:
-In-Reply-To: message from gwyn@smoke.brl.mil
->That's true, but even well-written assembly code is unlikely to be of
->much value for porting purposes, while some truly horrible C code has
->been ported all around the place.
-This is a good thing?!? As has been stated before by someone else (Forgive
-me, I've forgotten who.) I don't care if it's portable, I want the
-application to run on MY computer FAST!
If it's not portable, odds are it won't run on your computer at all!
If an application runs well enough, you should be happy, rather than
pointlessly wish that it ran faster.bazyar@ernie (Jawaid Bazyar) (05/01/91)
In article <16004@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes: >If it's not portable, odds are it won't run on your computer at all! > >If an application runs well enough, you should be happy, rather than >pointlessly wish that it ran faster. Of course, the IBM solution would be to throw more hardware at the problem. In this case, I'd simply recommend one of Zip's fastastic and cheap accelerators. -- Jawaid Bazyar | "Twenty seven faces- with their eyes turned to Senior/Computer Engineering | the sky. I have got a camera, and an airtight bazyar@cs.uiuc.edu | alibi.." Apple II Forever! | I need a job... Be priviliged to pay me! :-)
psonnek@pro-mansion.cts.com (Patrick Sonnek) (05/03/91)
In-Reply-To: message from gwyn@smoke.brl.mil >If it's not portable, odds are it won't run on your computer at all! If it was written in Assembly for my computer specifically it will! >If an application runs well enough, you should be happy, rather than >pointlessly wish that it ran faster. I don't consider speed to be pointless. My time is valuable. I don't want to have to wait around for inefficient code. I think you will find my opinions about speed to be consistant with the users (as opposed to programmers) out there in the world. It has been show, on mainframes anyways, that users will find response time of 3 seconds or less acceptable, with subsecond being the optimum. Anything over 3 seconds is intolerable to the average user. So speed is very important. ---- ProLine: psonnek@pro-mansion Sysop Pro-mansion: 507/726-6181 Internet: psonnek@pro-mansion.cts.com MCImail: psonnek UUCP: crash!pro-mansion!psonnek ARPA: crash!pro-mansion!psonnek@nosc.mil BITNET: psonnek%pro-mansion.cts.com@nosc.mil <<Real programmers don't program in HLL's.>> <<HLL's are for wimpy application coders!!>>
jerry@polygen.uucp (Jerry Shekhel) (05/03/91)
psonnek@pro-mansion.cts.com (Patrick Sonnek) writes: > >This is a good thing?!? As has been stated before by someone else (Forgive >me, I've forgotten who.) I don't care if it's portable, I want the >application to run on MY computer FAST! > What a joke. If you want the application to run on your computer AT ALL, you'd better care about portability. -- +-------------------+----------------------+---------------------------------+ | JERRY J. SHEKHEL | POLYGEN CORPORATION | When I was young, I had to walk | | Drummers do it... | Waltham, MA USA | to school and back every day -- | | ... In rhythm! | (617) 890-2175 | 20 miles, uphill both ways. | +-------------------+----------------------+---------------------------------+ | ...! [ princeton mit-eddie bu sunne ] !polygen!jerry | | jerry@polygen.com | +----------------------------------------------------------------------------+
philip@utstat.uucp (Philip McDunnough) (05/04/91)
In article <8992@crash.cts.com> psonnek@pro-mansion.cts.com (Patrick Sonnek) writes: [quotes ongoing discussion of HLL's vs assembler and an opinion on the need for an application to run faster if it runs well] >I don't consider speed to be pointless. My time is valuable. I don't want to >have to wait around for inefficient code. I think you will find my opinions >about speed to be consistant with the users (as opposed to programmers) out >there in the world. I consider myself a user, and I suppose my time is valuable. This endless search for more speed certainly doesn't make me want to jump up and down. I'd rather see elegance, beauty, thought,etc...put into problems than lines of programmers/users churning out endless code and speadsheets. I'm not saying that there isn't a time and place for fast computers. I am saying it's relatively unimportant in the larger scheme of things. There are more important issues involved with computing, from a uer's point of view at least, than just getting anything to run faster. Philip McDunnough University of Toronto [my opinions,etc...]
rhyde@ucrmath.ucr.edu (randy hyde) (05/06/91)
>>>>
What a joke. If you want the application to run on your computer AT ALL,
you'd better care about portability.
<<<<
I write lots of device drivers for PCs. I can promise you they are not
portable. They work fine on my computer. I don't care if they're
portable to a Mac or an Apple II.
There is a big problem with portable "Applications." In general, they have
to be written to the lowest common denominator. I'm sorry, but people have
been arguing this once one microcomputers since the TRS-80 vs. Apple II days.
The non-portable products always won out in the marketplace. Even Word Perfect
(who has ported their application to more diverse systems than anyone else
I know of) doesn't write portable software. They've written specific
versions for each platform.
Most people won't spend a lot of money on application which barely run on
a machine (i.e., AT ALL). They want those applications to take advantage of
the features of the machine. Look at how many people around here who are
complaining because software vendors only support generic Apple II computers
and don't support the gee whiz features of the GS. Hey, it's portable if it
runs on the II. That doesn't make GS owners very happy. They'll buy such
products grudingly, if at all.
*** RAndy Hydedat33228@uxa.cso.uiuc.edu (Derek A. Taubert) (05/07/91)
In article <1074@stewart.UUCP> jerry@stewart.UUCP (Jerry Shekhel) writes: >psonnek@pro-mansion.cts.com (Patrick Sonnek) writes: >> >>This is a good thing?!? As has been stated before by someone else (Forgive >>me, I've forgotten who.) I don't care if it's portable, I want the >>application to run on MY computer FAST! >> > >What a joke. If you want the application to run on your computer AT ALL, >you'd better care about portability. Excuse me for returning the same arrogance that you just so wonderfully displayed, but do you even know what portability means? -- + Derek Taubert --> derek@mrcnext.cso.uiuc.edu + Author of : GScii+ + + dat33228@uxa.cso.uiuc.edu + and the world's most useless + ++++++++++++++++++++++++++++++++++++++++++++++++ desk accessory -> Amaze me + + There are MOUSE technotes? + *******8-) ++++++++++++++++++++++++++++++++ + Ask me about my GS load meter + ^^^^^^^^^^ Marge Simpson +