[comp.sys.apple2] CRCs

rhyde@ucrmath.ucr.edu (randy hyde) (04/20/91)

I finally got my act together and I'm submitting some examples of CRC-16/
CCITT algorithms.  For a really good discussion of the CRC-16/CCITT algo-
rithm, please see "The C Programmer's Guide to Serial Communications" by
Campbell.  This is a great book and should be in the hands of anyone who
needs to deal with any form of serial communications, regardless of the
language they happen to program in.

A *brief* description of CRC-16/CCITT: There are many CRC (cyclic
redundancy check) algorithms laying around.  For example, someone posted
the CRC algorithm used by MINIX on the net here, it is not the same as
the standard CRC-16/CCITT algorithm (I cannot verify this since I do not
yet have Minix sources, the post could have been wrong).  CRC algorithms
are used like checksums and parity-- they're used to verify the correct-
ness of a block of data.  The CRC-16/CCITT algorithm has many advantages
over checksums and parity checks.  Such a description is beyond the scope
of this article.  See the reference above for more details.  Also, the
Aug/Sep 1990 C User's Journal had several articles about CRC's and other
data checking algorithms dealing with serial communications.

The CRC-16/CCITT algorithm basically computes the remainder of an n-bit
number divided by 0x1021 using 1's complement arithmetic (mod 2).  This
algorithm was chosen because it is fairly reliable at detecting all sorts
of errors *and* (most importantly) it is very easy to implement in
hardware.  CRC's were originally designed to verify data coming off a
disk drive.  Since the data comes off the disk at a rapid rate, software
(at the time) was unable to keep up with it.  Therefore, they needed to
compute the CRC using hardware.  A 1's complement division by 0x1021 is
easy to implement using a couple of shift registers and XOR gates.

Therefore, a CRC-16 algorithm is essentially a long division algorithm
which uses XOR rather than subtraction at each step in the division.  It
throws away the quotient and keeps the remainder.  There is actually a
little more to it than this, but that's close enough for now.

To compute the CRC-16/CCITT value (I'll just call it CRC from here on out)`
you initialize the crc value to zero (many modern algorithms initialize it
to 0xffff instead, for some very good reasons).  You then grab each byte
from the message and, one bit at a time, xor the H.O. bit of the data
with the H.O. bit of the current checksum.  If the result is zero, you
simply shift the crc (16-bits) and the data (8 bits) one position to the
left.  If the XOR of the H.O. bits is one, you still shift the data byte
one position to the left, the crc value, however, you must shift and then
xor with 0x1021 (the CCITT "polynomial").  You repeat this for all eight
bits and then repeat the process for the entire byte stream of the message.
The following C(++) code does this:

#include <stdio.h>

int crc(int data, int curcrc)
{
	static int i;

	for (i=0; i<8; ++i)
	{
		if ((data << 8) ^ curcrc)
			curcrc = (curcrc << 1) ^ 0x1021;
		else	curcrc <<= 1;
		data <<= 1;
	}
	return curcrc;
}

int main()
{
	int i,j,c;

	for (i=0; i < 10000; ++i)
	{
		c = 0;
		for (j=0; j < 256; ++j)
			c = crc(j,c);
	}
	puts("done");
}

BTW, there is a minor bug (affecting portability) in this code. Can you
spot it?  I left this bug in to make the above code much easier to read.
I will present a version of this code in a moment with the bug removed.

Running time for this algorithm is 3:39 on a 12Mhz 80286 (I had my
laptop with me at Cal Poly yesterday while working on this;  I apologize
for not doing this on a GS, but my GS isn't exactly portable).  Note: I
used Borland`s BC (C++) compiler, emitting 8086 code with optimizations
turned on (although I doubt optimization would make any difference on
such a small, simple, program)

Implemented in (8086) assembly language, the above program looks like the
following:

		include	stdlib.a
;****************************************************************************
;
;
;
;
;
cseg		segment	para public 'code'
		assume	cs:cseg, ds:dseg
;
;
		public	PSP
PSP		dw	?			;Must be in CODE segment!
;
;
; Some useful constants:
;
cr		equ	13
lf		equ	10
eos		equ	0
;
true		equ	1
false		equ	0
;
;
; Main is the main program.  Program execution always begins here.
;
Main		proc
		mov	cs:PSP, es		;Save pgm seg prefix
		mov	ax, seg dseg		;Set up the segment registers
		mov	ds, ax
		mov	es, ax
		mov	dx, 0			;Allocate all available RAM.
		MemInit
;
;
		mov	cx, 10000
crclp1:		xor	dx, dx			;Set DX to zero.
		mov	ax, dx
crclp2:		call	crc
		inc	al
		jne	crclp2
		dec	cx
		jne	crclp1
		print
		db	"done",0
;
;
;
Quit:		mov     ah, 4ch
		int     21h
;
Main		endp
;
;
; CRC routine:
;
; Entry:
; DX contains current crc value
; AL contains new character
;
; Exit:
; DX contains new crc value
;
crc		proc	near
		push	cx
		push	ax
		mov	cx, 8			;Repeat once for each bit.
BitLoop:	mov	ah, al			;Get a copy of data value.
		xor	ah, dh			;This clears the carry!
		jns	NoCCITT			;XOR of H.O. bits = 1?
		xor	dx, 0810h		;1021h shr 1
		stc				;Underflow from 1021 shr 1
;
; The following rcl is just a SHL if we branched from above (since the XOR
; instruction clears the carry flag).  It is an SHL which shifts in a 1
; if we fall through from above.  Shifting in a one supplies the L.O. bit
; of 1021h which we XOR'd in above.
;
NoCCITT:	rcl	dx, 1
		shl	al, 1			;Go to the next bit in data.
		loop	BitLoop
		pop	ax
		pop	cx
		ret
crc		endp
;
;
cseg            ends
;
;
; Allocate a reasonable amount of space for the stack (2k).
;
sseg		segment	para stack 'stack'
stk		db	256 dup ("stack   ")
sseg		ends
;
;
		end	Main


Again, I apologize for 8086 assembly, it's all I had available at the
time.  Running time for this code on the same machine as above: 1:19.

The above are the simple and easy to understand versions of the CRC
algorithm.  Alas, they are very slow.  Keep in mind, the CRC algorithm
may need to keep up with a fast serial line (or some other fast device).
We certainly don't want to lose characters because of CRC computation.

Before jumping into major optimizations, let's consider a very common
optimization that just about any assembly language programmer would
think of and good "C" programmers would think of: unravelling (unrolling)
the loops.  The loop in the crc function can be easily unrolled since it
executes exactly eight times each time you call it.  Since the function
is rather simple, why bother with the function call?  Why not include
the code in-line and dispense with the overhead of the call?  The following
implement this:


#include <stdio.h>

#define crc18(data, curcrc)	if ((data << 8) ^ curcrc) \
					curcrc = ((curcrc << 1) ^ 0x1021) & 0xffff; \
				else curcrc = (curcrc << 1) & 0xffff;\
				data = (data << 1) & 0xff00;

int crc(int data, int curcrc)
{
	static int i;

	data <<= 8;
	for (i=0; i<8; ++i)
	{
		if (data ^ curcrc)
			curcrc = ((curcrc << 1) ^ 0x1021) & 0xffff;
		else	curcrc = (curcrc << 1) & 0xffff;
		data = (data << 1) & 0xff00;
	}
	return curcrc;
}

int main()
{
	int i,j,k,c;

	for (i=0; i < 10000; ++i)
	{
		c = 0;
		for (j=0; j < 256; ++j)
			c = crc(j,c);
	}
	puts("done, press return for macro version");
	c = getchar();
	for (i=0; i < 10000; ++i)
	{
		c = 0;
		for (j=0; j < 256; ++j)
		{
			k=j;
			crc18(c,k);
			crc18(c,k);
			crc18(c,k);
			crc18(c,k);
			crc18(c,k);
			crc18(c,k);
			crc18(c,k);
			crc18(c,k);
		}
	}
	puts("done");
}

This code implements *both* the function call and the unrolled version.
If you didn't notice the bug I corrected, shorts and ints in C have no
guaranteed size other than they are at least 16 bits long (ANSI C).
You have no guarantee that shorts are 16 bits long, which is why we
need the and with 0xffff above.  If your C compiler has 16-bit shorts,
I would hope that it would optimize out this operation.

Running time for this code:  2:02 (at least for the in-line version).
A substantial improvement.

Consider the same code in 8086 assembly:

		include	stdlib.a
;****************************************************************************
;
Crc18		macro
		local	NoCCITT
		mov	ah, al
		xor	ah, dh			;This clears the carry!
		jns	NoCCITT
		xor	dx, 0810h		;1021h shr 1
		stc				;Underflow from 1021 shr 1
NoCCITT:	rcl	dx, 1
		shl	al, 1
		endm
;
;
;
cseg		segment	para public 'code'
		assume	cs:cseg, ds:dseg
;
;
		public	PSP
PSP		dw	?			;Must be in CODE segment!
;
;
; Some useful constants:
;
cr		equ	13
lf		equ	10
eos		equ	0
;
true		equ	1
false		equ	0
;
;
; Main is the main program.  Program execution always begins here.
;
Main		proc
		mov	cs:PSP, es		;Save pgm seg prefix
		mov	ax, seg dseg		;Set up the segment registers
		mov	ds, ax
		mov	es, ax
		mov	dx, 0			;Allocate all available RAM.
		MemInit
;
;
		mov	cx, 10000
crclp1:		xor	dx, dx			;Set DX to zero.
		mov	ax, dx
crclp2:		push	ax
		crc18
		crc18
		crc18
		crc18
		crc18
		crc18
		crc18
		crc18
		pop	ax
		inc	al
		jne	crclp2
		dec	cx
		jne	crclp1
		print
		db	"done",0
;
;
;
Quit:		mov     ah, 4ch
		int     21h
;
Main		endp
;
;
;
;
cseg            ends
;
;
; Allocate a reasonable amount of space for the stack (2k).
;
sseg		segment	para stack 'stack'
stk		db	256 dup ("stack   ")
sseg		ends
;
;
;
		end	Main


Running time for this code:  0:48.  Again a substantial improvement.

I would like to point out that the above algorithms are about as far
as you can take the CRC-16/CCITT algorithm and still have code that
is easily recognizable for what it does.  Additional optimizations will
destroy the readability of the code.  Nonetheless, even 48 seconds may
be a little too slow for someone wanting to do 2,560,000 CRC computations.
Let's explore the next step.

Any assembly language programmer worth his salt would immediately suggest
"table look-up" as a means to optimize the program.  A good C programmer
would also make this suggestion (although you see this technique used
*far* less often in C than in assembly).  The code Doug posted used this
technique to speed things up (which is why the C code was very difficult
to follow).  A modified version of that code (modified so that the calling
sequence is similar to the above routines) follows:

#include <iostream.h>

unsigned crctab[2][16] = {
0x0000, 0xc0c1, 0xc181, 0x0140, 0xc301, 0x03c0, 0x0280, 0xc241,
0xc601, 0x06c0, 0x0780, 0xc741, 0x0500, 0xc5c1, 0xc481, 0x0440,

0x0000, 0xcc01, 0xd801, 0x1400, 0xf001, 0x3c00, 0x2800, 0xe401,
0xa001, 0x6c00, 0x7800, 0xb401, 0x5000, 0x9c01, 0x8801, 0x4400};

int crc(int data, int curcrc)
{
	static int temp;
	temp = curcrc ^ data;
	return crctab[0] [temp & 0xf] ^ crctab[1][(temp >>4) & 0xf] ^
		(curcrc >> 8);
}

int main()
{
	int i,j,c;

	for (i=0; i < 10000; ++i)
	{
		c = 0;
		for (j=0; j < 256; ++j)
			c = crc(j,c);
	}
	cout << "done";
}

This code is an interesting compromise between space and speed.  Rather
than use a 256 entry lookup table indexed by each byte of data, this
algorithm uses the two nibbles of each data byte as an index into two
halves of a 2D matrix (one side for L.O. nibble, one side for H.O. nibble).
Someone had to think long and hard to figure this one out, but more on
that later.

Running time: 0:47  A *very* substantial improvement over the bit at a
time algorithm.

A couple of comments about the above algorithm- the use of a 2D array is
weird.  On a compiler which doesn't optimize things too well, this would
cost you dearly (since it would require a multiply and add to compute the
address of the selected array element).  Why the original author didn't
use two separate arrays escapes me.  This is probably a case of not 
realizing the underlying data structure (e.g., they should have known more
assembly) or they didn't care about performance across architectures and
compilers.

I cannot imagine an assembly language programmer attacking this problem in
this fashion.  Therefore, I didn't bother implemented it.  Most assembly
language programmers would go directly to a 256-element look up table and
use the data byte as an index into that table.  That's what the next
section of code does:

If you're willing to live with a 256 element array rather than a 32-element
array, the following C code is even better:





#include <iostream.h>

unsigned crctab[256] = {
0x0000,
0x1021,0x2042,0x3063,0x4084,0x50A5,0x60C6,0x70E7,0x8108,
0x9129,0xA14A,0xB16B,0xC18C,0xD1AD,0xE1CE,0xF1EF,0x1231,
0x0210,0x3273,0x2252,0x52B5,0x4294,0x72F7,0x62D6,0x9339,
0x8318,0xB37B,0xA35A,0xD3BD,0xC39C,0xF3FF,0xE3DE,0x2462,
0x3443,0x0420,0x1401,0x64E6,0x74C7,0x44A4,0x5485,0xA56A,
0xB54B,0x8528,0x9509,0xE5EE,0xF5CF,0xC5AC,0xD58D,0x3653,
0x2672,0x1611,0x0630,0x76D7,0x66F6,0x5695,0x46B4,0xB75B,
0xA77A,0x9719,0x8738,0xF7DF,0xE7FE,0xD79D,0xC7BC,0x48C4,
0x58E5,0x6886,0x78A7,0x0840,0x1861,0x2802,0x3823,0xC9CC,
0xD9ED,0xE98E,0xF9AF,0x8948,0x9969,0xA90A,0xB92B,0x5AF5,
0x4AD4,0x7AB7,0x6A96,0x1A71,0x0A50,0x3A33,0x2A12,0xDBFD,
0xCBDC,0xFBBF,0xEB9E,0x9B79,0x8B58,0xBB3B,0xAB1A,0x6CA6,
0x7C87,0x4CE4,0x5CC5,0x2C22,0x3C03,0x0C60,0x1C41,0xEDAE,
0xFD8F,0xCDEC,0xDDCD,0xAD2A,0xBD0B,0x8D68,0x9D49,0x7E97,
0x6EB6,0x5ED5,0x4EF4,0x3E13,0x2E32,0x1E51,0x0E70,0xFF9F,
0xEFBE,0xDFDD,0xCFFC,0xBF1B,0xAF3A,0x9F59,0x8F78,0x9188,
0x81A9,0xB1CA,0xA1EB,0xD10C,0xC12D,0xF14E,0xE16F,0x1080,
0x00A1,0x30C2,0x20E3,0x5004,0x4025,0x7046,0x6067,0x83B9,
0x9398,0xA3FB,0xB3DA,0xC33D,0xD31C,0xE37F,0xF35E,0x02B1,
0x1290,0x22F3,0x32D2,0x4235,0x5214,0x6277,0x7256,0xB5EA,
0xA5CB,0x95A8,0x8589,0xF56E,0xE54F,0xD52C,0xC50D,0x34E2,
0x24C3,0x14A0,0x0481,0x7466,0x6447,0x5424,0x4405,0xA7DB,
0xB7FA,0x8799,0x97B8,0xE75F,0xF77E,0xC71D,0xD73C,0x26D3,
0x36F2,0x0691,0x16B0,0x6657,0x7676,0x4615,0x5634,0xD94C,
0xC96D,0xF90E,0xE92F,0x99C8,0x89E9,0xB98A,0xA9AB,0x5844,
0x4865,0x7806,0x6827,0x18C0,0x08E1,0x3882,0x28A3,0xCB7D,
0xDB5C,0xEB3F,0xFB1E,0x8BF9,0x9BD8,0xABBB,0xBB9A,0x4A75,
0x5A54,0x6A37,0x7A16,0x0AF1,0x1AD0,0x2AB3,0x3A92,0xFD2E,
0xED0F,0xDD6C,0xCD4D,0xBDAA,0xAD8B,0x9DE8,0x8DC9,0x7C26,
0x6C07,0x5C64,0x4C45,0x3CA2,0x2C83,0x1CE0,0x0CC1,0xEF1F,
0xFF3E,0xCF5D,0xDF7C,0xAF9B,0xBFBA,0x8FD9,0x9FF8,0x6E17,
0x7E36,0x4E55,0x5E74,0x2E93,0x3EB2,0x0ED1,0x1EF0};

int crc(int data, int curcrc)
{
	static int comb_val;
	comb_val = (curcrc >> 8) ^ data;
	return (curcrc << 8) ^ crctab[comb_val];
}

int main()
{
	int i,j,c;

	for (i=0; i < 10000; ++i)
	{
		c = 0;
		for (j=0; j < 256; ++j)
			c = crc(j,c);
	}
	cout << "done";
}

Running time: 0:39;  again, a substantial improvement.  Now look at the
same code in 8086 assembly:

		include	stdlib.a
;****************************************************************************
;
;
; Global variables go here:
;
dseg		segment	para public 'data'
;
		.radix	16
crctbl		dw	00000
 dw 01021,02042,03063,04084,050A5,060C6,070E7,08108
 dw 09129,0A14A,0B16B,0C18C,0D1AD,0E1CE,0F1EF,01231
 dw 00210,03273,02252,052B5,04294,072F7,062D6,09339
 dw 08318,0B37B,0A35A,0D3BD,0C39C,0F3FF,0E3DE,02462
 dw 03443,00420,01401,064E6,074C7,044A4,05485,0A56A
 dw 0B54B,08528,09509,0E5EE,0F5CF,0C5AC,0D58D,03653
 dw 02672,01611,00630,076D7,066F6,05695,046B4,0B75B
 dw 0A77A,09719,08738,0F7DF,0E7FE,0D79D,0C7BC,048C4
 dw 058E5,06886,078A7,00840,01861,02802,03823,0C9CC
 dw 0D9ED,0E98E,0F9AF,08948,09969,0A90A,0B92B,05AF5
 dw 04AD4,07AB7,06A96,01A71,00A50,03A33,02A12,0DBFD
 dw 0CBDC,0FBBF,0EB9E,09B79,08B58,0BB3B,0AB1A,06CA6
 dw 07C87,04CE4,05CC5,02C22,03C03,00C60,01C41,0EDAE
 dw 0FD8F,0CDEC,0DDCD,0AD2A,0BD0B,08D68,09D49,07E97
 dw 06EB6,05ED5,04EF4,03E13,02E32,01E51,00E70,0FF9F
 dw 0EFBE,0DFDD,0CFFC,0BF1B,0AF3A,09F59,08F78,09188
 dw 081A9,0B1CA,0A1EB,0D10C,0C12D,0F14E,0E16F,01080
 dw 000A1,030C2,020E3,05004,04025,07046,06067,083B9
 dw 09398,0A3FB,0B3DA,0C33D,0D31C,0E37F,0F35E,002B1
 dw 01290,022F3,032D2,04235,05214,06277,07256,0B5EA
 dw 0A5CB,095A8,08589,0F56E,0E54F,0D52C,0C50D,034E2
 dw 024C3,014A0,00481,07466,06447,05424,04405,0A7DB
 dw 0B7FA,08799,097B8,0E75F,0F77E,0C71D,0D73C,026D3
 dw 036F2,00691,016B0,06657,07676,04615,05634,0D94C
 dw 0C96D,0F90E,0E92F,099C8,089E9,0B98A,0A9AB,05844
 dw 04865,07806,06827,018C0,008E1,03882,028A3,0CB7D
 dw 0DB5C,0EB3F,0FB1E,08BF9,09BD8,0ABBB,0BB9A,04A75
 dw 05A54,06A37,07A16,00AF1,01AD0,02AB3,03A92,0FD2E
 dw 0ED0F,0DD6C,0CD4D,0BDAA,0AD8B,09DE8,08DC9,07C26
 dw 06C07,05C64,04C45,03CA2,02C83,01CE0,00CC1,0EF1F
 dw 0FF3E,0CF5D,0DF7C,0AF9B,0BFBA,08FD9,09FF8,06E17
 dw 07E36,04E55,05E74,02E93,03EB2,00ED1,01EF0;
;
		.radix	10
dseg		ends
;
;
;
;
cseg		segment	para public 'code'
		assume	cs:cseg, ds:dseg
;
;

;
		public	PSP
PSP		dw	?			;Must be in CODE segment!
;
;
;
Main		proc
		mov	cs:PSP, es		;Save pgm seg prefix
		mov	ax, seg dseg		;Set up the segment registers
		mov	ds, ax
		mov	es, ax
;
		mov	cx, 10000
crclp1:		xor	dx, dx			;Set DX to zero.
		mov	ax, dx
crclp2:		call	crc
		inc	al
		jne	crclp2
		dec	cx
		jne	crclp1
		print
		db	"done",0
;
;
Quit:		mov     ah, 4ch
		int     21h
;
Main		endp
;
;
; CRC update routine.
;
; On Entry:
; AL- New data value
; DX- Current CRC value
;
; On Exit:
; DX- New CRC value
;
crc		proc	near
		mov	bl, dh
		xor	bl, al
		mov	bh, 0
		mov	dh, dl
		mov	dl, bh
		shl	bx, 1
		xor	dx, crctbl[bx]
		ret
crc		endp
;
;
;
cseg            ends
;
;
; Allocate a reasonable amount of space for the stack (2k).
;
sseg		segment	para stack 'stack'
stk		db	256 dup ("stack   ")
sseg		ends
;
;
		end	Main


Running time:  0:17

I could probably get this down to 0:15 or 0:16 if I encoded the CRC
calculation in-line rather than as a procedure/function.

Assembly language typically works out a little better than twice as
fast.  The only surprise here is that it's actually that good.  For
such simple, short, routines, ASM's advantage over C usually isn't this
good.  For a CRC algorithm, speed is very important, so ASM is, perhaps,
a better choice.

But that wasn't the purpose of doing all this.  Earlier (on the net), I'd
made the statement that for a CRC algorithm, ASM isn't particularly more
difficult to understand than C (and this is true, in general, for most
bit-pushing operations).  Now I'm assuming we're talking about expert
C programmers looking at C code and expert assembly language programmers
looking at ASM code.  It isn't fair to take an expert C programmer who is
a mediocre ASM programmer and have that same person evaluate C and ASM
code.

I claim the ASM code above isn't any more difficult to understand than the
C code (assuming, of course, that you know the algorithm that the code
is attempting to implement).

In Doug's original post, he claimed that the ASM code wasn't any more
readable than the C code.  The reasons are two-fold: (1) the code used
a truly miserable algorithm which was next to impossible to understand
in the first place (*NO* language is going to make that algorithm
particularly clear), (2) the ASM code looked like the output of a
compiler.  I would be ashamed of myself if I wrote assembly code that
looked like that.  Clearly compiler-generated assembly code is *not*
going to be clearer to understand than hand written code.  My apologizes
if I'm offending an actual human author of that code.

One final point- Note that my minor optimizations in assembly language`
(unrolling the loop) ran at the same speed as the nibble loop-up table
optimized code.  For the same speed, I claim my assembly code is *much*
easier to understand than the optimized C code.  From a development
point of view, it would take far more time to dream up, implement, and
verify the nibble-based C code that Doug posted than it would take to
implement the unrolled version of the original CRC algorithm in assembly.
I offer this as proof that assembly is often the better choice for a
project.  Sure, you can find better algorithms in C which run faster than
bad algorithms in ASM, but the end result is that you spend *far* more
time designing, testing, and debugging the C code.  And if you can find
a better algorithm that works in C, that same algorithm will work in
assembly as well.

Keep in mind, that the CRC algorithm is intrinsically O(n).  You're not
going to do any better than this because it takes n operations to read
the data.  Therefore, the only thing we can attack is the constant
in the complexity of this algorithm.  ASM is really good at this.

*** Randy Hyde

gwyn@smoke.brl.mil (Doug Gwyn) (04/21/91)

In article <13688@ucrmath.ucr.edu> rhyde@ucrmath.ucr.edu (randy hyde) writes:
>Assembly language typically works out a little better than twice as fast.

But the comparison was not really fair, since the assembly-language
example implementation did not support the same linkage as the C code.
For a function that performs so little work, function linkage overhead
is an important factor.

Also, the example C code was not optimal for the given algorithm.
By making a few minor tweaks to it I reduced the C example run time
by a factor of 26% on the host on which I read net news.  Profiling
the code showed that approximately as much time was being spent in
the main program's loop overhead as in the crc function itself.

For the revised crc function C source code:

int crc(data, curcrc)
	int data;
	register unsigned int curcrc;
{
	return (curcrc << 8) ^ crctab[(curcrc >> 8) ^ data];
}

a fairly ordinary old-technology C compiler on a Gould PowerNode
produced the following machine code:

_crc:	enter	8w,0x5		-- function linkage
	movw	9w+8w[b2],r1	-- pick up "curcrc" argument
	movw	r1,r0		-- make a second copy of it
	lslw	#8,r0		-- shift one copy left (accumulator)
	lsrw	#8,r1		-- shift the other copy right
	xorw	8w+8w[b2],r1	-- XOR shifted copy with "data" argument
	lslw	#2,r1		-- shift for indexing crctab[]
	xorw	_crctab[r1],r0	-- XOR table entry with accumulator
	retn			-- function linkage

and a similar compiler on a DEC VAX produced the following machine code:

_crc:	.word	0x800		; function linkage
	movl	8(ap),r11	; pick up "curcrc" argument
	extzv	$8,$24,r11,r0	; shift one copy left (accumulator)
	xorl2	4(ap),r0	; XOR shifted copy with "data" argument
	ashl	$8,r11,r1	; shift the other copy right
	xorl3	r1,_crctab[r0],r0 ; XOR table entry with accumulator
	ret			; function linkage

Note that there isn't very much that one could do even by hand-coded
assembly language to perform this computation much more efficiently.
(Except that the VAX actually has microcode support for a CRC
instruction!  That is an example of a situation in which use of
assembly language could be justified, since it is somewhat rare
for a C compiler to exploit such specialized instructions.  Another
possible example would be in a bottleneck loop when testing the
"carry bit" would be much more efficient than trying to do without
it; originally I thought that was going to be the justification for
coding the CRC algorithm in assembler.)

>I claim the ASM code above isn't any more difficult to understand than the
>C code (assuming, of course, that you know the algorithm that the code
>is attempting to implement).

For such a small function, the conceptual burden is indeed manageable.
For a larger example involving linked data structures, etc. I would
prefer to deal with C code over assembly code.

>From a development point of view, it would take far more time to dream
>up, implement, and verify the nibble-based C code that Doug posted than
>it would take to implement the unrolled version of the original CRC
>algorithm in assembly.

But that's comparing apples and oranges.  For the same algorithm with
the same constraints (particularly function linkage), I don't think
you've shown that assembly language code is appreciably faster than C
code, taking into account my earlier observations.

>I offer this as proof that assembly is often the better choice for a
>project.  Sure, you can find better algorithms in C which run faster than
>bad algorithms in ASM, but the end result is that you spend *far* more
>time designing, testing, and debugging the C code.  And if you can find
>a better algorithm that works in C, that same algorithm will work in
>assembly as well.

The C code, if you do it right, is immediately usable on a variety of
machines, and development cost of the algorithm is independent of the
implementation language.  Implementation of the algorithm in C is no
harder than in assembly language, and for complex algorithms it is
very nice to be able to rely on the compiler for help with the
bookkeeping.  Why do you think high-level languages were invented?

>Keep in mind, that the CRC algorithm is intrinsically O(n).  You're not
>going to do any better than this because it takes n operations to read
>the data.  Therefore, the only thing we can attack is the constant
>in the complexity of this algorithm.  ASM is really good at this.

And if you look at the above C compiler output, you'll see that C is
also really good at this, at least on decent machine architectures.

As I have previously remarked, there are legitimate uses for assembler,
but since C is about as fast, infinitely more portable, and easier to
use for large projects, I recommend C over assembler in the majority of
situations.

mdavis@pro-sol.cts.com (Morgan Davis) (04/22/91)

In-Reply-To: message from rhyde@ucrmath.ucr.edu

Excellent article on CRC's, Randy.  This is one to tuck away in the
programmer's notes for sure.

I went through this kind of wracking to find a more optimal algorithm.  The
table method is, by far, the best when it comes to speed.  Though,
sometimes just a quick and dirty (or maybe that should be "small and
dirty") variation is to use the following:

CalcCRC (ptr, count)
        char *ptr;
        word count;
{
        register word x;
               
        do {
                CRC ^= *ptr++ << 8;     /* XOR hi-byte with data */
                for (x = 8; x; --x)
                        if (CRC & 0x8000)
                                CRC = CRC << 1 ^ 0x1021;
                        else
                                CRC <<= 1;
        } while (--count);
}

This function operates on a global named CRC.  You call it by pointing to a
data buffer, and passing the size of the buffer.  After calling CaclCRC()
the CRC global integer is updated accordingly.

The optimizations here that differ from the original, frequently published
version is that it:

o       Counts from 8 bits down to none, so the loop
        test simply involves a test for zero instead of
        comparing to 8.

o       Bit 15 of the CRC is tested with the bitwise AND
        operation instead of shifting it around to determine
        if the CRC needs the XOR treatment.

o       It operates on a buffer, rather than having to pass
        it a byte of data to work on, one at a time.

This optimization was done with the 65816 in mind.  So these kinds of
optimizations may not make any difference on other processors, but I
doubt if they would hurt.

BTW, the 65816 variant on the above is:

CalcCRC         ldy     #0              ;index into buffer
newbyte         shortm                  ;go to 8-bits
                lda     [ptr],y         ;grab some data
                iny                     ;prepare for next byte
                eor     CRC+1           ;fix high-order byte of CRC
                sta     CRC+1
                longm                   ;back to 16-bit

                ldx     #8              ;init shift counter
shift           asl     CRC             ;always shift it once
                bcc     next            ;if bit 15 clear, skip the XOR

                lda     CRC             ;XOR CRC with $1021
                eor     #$1021
                sta     CRC
next            dex
                bne     shift           ;do this 8 times

                dec     count           ;more bytes to do?
                bne     newbyte         ;yes

                rts

--Morgan

UUCP: crash!pro-sol!mdavis           AOL, BIX: mdavis  
ARPA: crash!pro-sol!mdavis@nosc.mil  GEnie:    m.davis42
INET: mdavis@pro-sol.cts.com         ProLine:  mdavis@pro-sol

toddpw@nntp-server.caltech.edu (Todd P. Whitesel) (04/22/91)

Morgan, if you can live with 512 bytes of CRC tables then you can do
block CRC's in something like 25 cycles per byte CRC'd...

	lda	CRC
	xba
	sep	#$30

crclp	eor	[data],y
	tax
	lda	tblL,x
	xba
	eor	tblH,x
	iny
	bne	crclp
	
	rep	#$30
	xba
	sta	CRC

will do 256 bytes at a time, with a best case of 24 cycles/byte and a worst of
27. (Aligning the DP saves 1, aligning tblH and tblL guarantees 4 cycles each.)
I know the table is generated by something simple like (sombody check me):
	tbl[byte] = updcrc(byte, 0);
where updcrc(byte, oldcrc) does not use the table, and tbl is split into high
and low byte arrays for the assembly fragment above.

Todd Whitesel
toddpw @ tybalt.caltech.edu

rhyde@gibson.ucr.edu (randy hyde) (04/23/91)

You keep claiming it isn't fair to compare an assembly language code routine
to a C routine when the assembly language code doesn't support C linkages.
Why would someone writing their code in assembly language use such linkages?
Of course if you were writing a single routine to link into a big C program.
But I'm talking about someone writing the entire program in C.

Clearly a small program is easier to manage and understand, regardless of the
language.  But the performance benefits of small programs side with HLLs as
well.  In the CRC example I have, the ASM program was only twice as fast as
the C version.  My experience on the 8086 is that the larger the program, the
larger the performance difference.  This is being reduced on newer processors
as companies like Motorola and Intel optimize their instruction sets for those
30% of the instructions that ***C and Pascal*** compilers use.  It's still true
in general though, that as the program gets larger, so does the speed
difference.

So far, my experience with well-written C programs and well-written asm
programs
is that they're both easy for me to understand (regardless of the size).  The
only problem is that very few people write well-written C or ASM
programs.  Don't
try to tell me it's easier to write well-written C programs.  My
experience shows
that either (1) it is not, or (2) C programmers aren't as good as ASM
programmers
because even though it's easier, they don't write better C programs.  Most C
programs are a mess.  I don't want flames from people saying "*I* write easy to
read C programs."  Anyone can post a single example.  In general, few people
(including myself) write easy to read and understand C programs all the time.

I can't comment on which is worse, a poorly written C program or a
poorly written assembly language program.  They're both really bad.  

>> I recommend C... for the majority of projects...

Surprise!  So do I.  Commercial software products is the one area where
I draw the line though.  Most projects are not commercial (by my
definition, mass market consumer) products.  This is a situation where
performance is more important than portability and, arguably, more
important than maintainability.  

Actually, I should revise the above statement, I don't recommend C for
the majority of projects, actually I recommend an appropriate HLL for
the job; this is not always C.  Most projects I recommend languages for
are not in the class of commercial products.
*** Randy Hdye

gwyn@smoke.brl.mil (Doug Gwyn) (04/24/91)

In article <13769@ucrmath.ucr.edu> rhyde@gibson.ucr.edu (randy hyde) writes:
>You keep claiming it isn't fair to compare an assembly language code routine
>to a C routine when the assembly language code doesn't support C linkages.
>Why would someone writing their code in assembly language use such linkages?
>Of course if you were writing a single routine to link into a big C program.
>But I'm talking about someone writing the entire program in C.

The reason is that you need to impose identical constraints ("requirements")
on the task if you want to have a fair comparison.  The C-generated machine
code that I posted shows that at least some C compilers generate fairly
decent instruction sequences for the algorithm itself.  The timing for the
CRC-16 example was dominated by "overhead" time, not by the CRC-16
computation.  For a fair comparison of different implementations of the
SAME task, the functional interface should be the same.  In fact, were I to
have to use the example CRC-16 C code in an actual application, it wouldn't
take me long to identify the bottleneck caused by making a function call
per every two bytes of checksummed data, and change the code to eliminate
that overhead; in effect, when writing the assembly code you had already
performed a similar linkage optimization by passing data in registers.

Almost all the assembly code I can lay my hands on around here is expected
to be used from a higher-level language, because we don't agree with the
notion that to gain any benefit from use of assembler it is necessary to
do EVERYthing in assembler.  For large projects, that would be sheer folly.
Consequently, it is reasonable to expect that much code in assembler will
have to meet the constraints imposed by HLL interprocedural linkage and
other run-time conventions.  If you were to tackle an entire project in
assembler, then you avoid any specific HLL's run-time conventions, but you
have to do similar bookkeeping somehow.  I can attest from experience that
it is harder to get those details right when using assembly language; more
of the responsibility for bookkeeping must be assumed by the programmer
(speaking generally) when using assembler -- which is one of the big
reasons for using a higher-level language.

>... as the program gets larger, so does the speed difference.

This directly contradicts the observations of professional software
engineers dating as far back as the 1960s.  Indeed, it is taken as a
virtual axiom these days that most of the execution time in spent in
a minority of the code.  Because most applications have one or more
such identifiable bottlenecks in a small portion of the code (often
in just a few lines of the HLL source code), even an infinite speedup
in the majority of the code would have little effect on the overall
application speed.  Indeed, use of assembly language to speed up
genuine bottlenecks is one of the few uses of assembly language that
I would consider legitimate -- but only after the bottleneck has been
clearly identified and only if other solutions are insufficient.  (Jon
Bentley has written some nice books on the subject of improving code
efficiency.)

>I can't comment on which is worse, a poorly written C program or a
>poorly written assembly language program.  They're both really bad.  

That's true, but even well-written assembly code is unlikely to be of
much value for porting purposes, while some truly horrible C code has
been ported all around the place.

Your point about coding quality is well taken -- no matter what
programming language is used, it is possible to produce a neat,
maintainable job or an ugly unmaintainable mess.

rhyde@dimaggio.ucr.edu (randy hyde) (04/27/91)

>>> Most of the execution time is spent in a small percentage of the code.

Ah, the good old 90-10 rule.
I subscribe to this rule in totality.  The only problem is finding that 10%
Most people assume the 10% (responsible for 90% of the execution time) is all
together in one big clump.  Extract that and rewrite it in assembly and your
program magically runs almost as fast as a pure assembly language program.
I've fallen into this trap before; several times in fact.  That's why I'm so
strong on assembly today.

I truly believe that 10% of the code is responsible for 90% of the
execution time.
It's just that that 10% consists of 0.5% here, 0.2 % there, 1% in that corner,
0.74% somewhere else, and so on.  Often the slow code is intermixed with
the fast code in such a way that you wind up rewriting a large part of
the program in assembly anyway.  I know, I've done this so many times in
the past on applications that were written in C or Pascal and needed a
considerable speed-up.
Two projects, in particular, come to mind.  One was an editor, the other
was a runoff type program.  On the runoff program I spent as much time
recoding portions of the program as I did writing it in Pascal in the
first place.
It ran only twice as fast when I was done, which wasn't fast enough.  I
re-wrote it in assembly and it ran about ten times faster than the
hybrid version.  
I had similar results with the editor.  BTW, rewriting it in assembly
took less time than hacking up the Pascal code with assembly.

Now I'm talking about programs on the order of 10,000 to 50,000 lines. 
I wouldn't make these same claims for a 1,000,000 line program.  OTOH, I
wouldn't work on a 1,000,000 line program, either.


BTW, I have Bentley's "Writing Efficient Programs" and I assign it in a couple
of my classes.  Definitely a good book regardless of your choice of language.

psonnek@pro-mansion.cts.com (Patrick Sonnek) (04/27/91)

In-Reply-To: message from gwyn@smoke.brl.mil

>That's true, but even well-written assembly code is unlikely to be of
>much value for porting purposes, while some truly horrible C code has
>been ported all around the place.

This is a good thing?!?  As has been stated before by someone else (Forgive
me, I've forgotten who.)  I don't care if it's portable, I want the
application to run on MY computer FAST!

----
ProLine:  psonnek@pro-mansion    Sysop Pro-mansion: 507/726-6181
Internet: psonnek@pro-mansion.cts.com  MCImail:     psonnek
UUCP:     crash!pro-mansion!psonnek  ARPA: crash!pro-mansion!psonnek@nosc.mil
BITNET:   psonnek%pro-mansion.cts.com@nosc.mil
               <<Real programmers don't program in HLL's.>>
               <<HLL's are for wimpy application coders!!>>

stephens@latcs2.lat.oz.au (Philip J Stephens) (04/28/91)

Randy Hyde writes:
>Two projects, in particular, come to mind.  One was an editor, the other
>was a runoff type program.  On the runoff program I spent as much time
>recoding portions of the program as I did writing it in Pascal in the
>first place.
>It ran only twice as fast when I was done, which wasn't fast enough.  I
>re-wrote it in assembly and it ran about ten times faster than the
>hybrid version.  
>I had similar results with the editor.  BTW, rewriting it in assembly
>took less time than hacking up the Pascal code with assembly.

  I would agree with Randy here.  People have been saying that the
time to write and debug a program in assembly is much more prohibative
than writing and debugging the same program in a HLL.  I disagree
with this point; if you're a bad programmer, then you'll do just as
bad a job in a HLL as you would in assembly.  But if you're a good
programmer, you'll be able to write in assembly just as fast and as
well as in a HLL.
  Of course, I do agree with the statement that you shouldn't
choose assembly over an HLL unless it is obvious that you're going to
run into speed problems.  There's a real art to debugging an assembly
language program, one that requires years of experience to develop.
Not that it's all that different from debugging a HLL program; just a
little trickier.

<\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/><\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\>
<  Philip J. Stephens                ><   "Many views yield the truth."       >
<  Hons. student, Computer Science   ><   "Therefore, be not alone."          >
<  La Trobe University, Melbourne    ><   - Prime Song of the viggies, from   >
<  AUSTRALIA                         ><   THE ENGIMA SCORE by Sheri S Tepper  >
</\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\></\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/>

gwyn@smoke.brl.mil (Doug Gwyn) (04/30/91)

In article <8866@crash.cts.com> psonnek@pro-mansion.cts.com (Patrick Sonnek) writes:
-In-Reply-To: message from gwyn@smoke.brl.mil
->That's true, but even well-written assembly code is unlikely to be of
->much value for porting purposes, while some truly horrible C code has
->been ported all around the place.
-This is a good thing?!?  As has been stated before by someone else (Forgive
-me, I've forgotten who.)  I don't care if it's portable, I want the
-application to run on MY computer FAST!

If it's not portable, odds are it won't run on your computer at all!

If an application runs well enough, you should be happy, rather than
pointlessly wish that it ran faster.

bazyar@ernie (Jawaid Bazyar) (05/01/91)

In article <16004@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:
>If it's not portable, odds are it won't run on your computer at all!
>
>If an application runs well enough, you should be happy, rather than
>pointlessly wish that it ran faster.

  Of course, the IBM solution would be to throw more hardware at the problem.
In this case, I'd simply recommend one of Zip's fastastic and cheap
accelerators.

--
Jawaid Bazyar               |  "Twenty seven faces- with their eyes turned to
Senior/Computer Engineering |    the sky. I have got a camera, and an airtight
bazyar@cs.uiuc.edu          |     alibi.."
   Apple II Forever!        |  I need a job... Be priviliged to pay me! :-)

psonnek@pro-mansion.cts.com (Patrick Sonnek) (05/03/91)

In-Reply-To: message from gwyn@smoke.brl.mil

>If it's not portable, odds are it won't run on your computer at all!

If it was written in Assembly for my computer specifically it will!


>If an application runs well enough, you should be happy, rather than
>pointlessly wish that it ran faster.

I don't consider speed to be pointless.  My time is valuable. I don't want to
have to wait around for inefficient code.  I think you will find my opinions
about speed to be consistant with the users (as opposed to programmers) out
there in the world.  It has been show, on mainframes anyways, that users will
find response time of 3 seconds or less acceptable, with subsecond being the
optimum.  Anything over 3 seconds is intolerable to the average user.  So
speed is very important.  


----
ProLine:  psonnek@pro-mansion    Sysop Pro-mansion: 507/726-6181
Internet: psonnek@pro-mansion.cts.com  MCImail:     psonnek
UUCP:     crash!pro-mansion!psonnek  ARPA: crash!pro-mansion!psonnek@nosc.mil
BITNET:   psonnek%pro-mansion.cts.com@nosc.mil
               <<Real programmers don't program in HLL's.>>
               <<HLL's are for wimpy application coders!!>>

jerry@polygen.uucp (Jerry Shekhel) (05/03/91)

psonnek@pro-mansion.cts.com (Patrick Sonnek) writes:
>
>This is a good thing?!?  As has been stated before by someone else (Forgive
>me, I've forgotten who.)  I don't care if it's portable, I want the
>application to run on MY computer FAST!
>

What a joke.  If you want the application to run on your computer AT ALL,
you'd better care about portability.
--
+-------------------+----------------------+---------------------------------+
| JERRY J. SHEKHEL  | POLYGEN CORPORATION  | When I was young, I had to walk |
| Drummers do it... | Waltham, MA USA      | to school and back every day -- |
|    ... In rhythm! | (617) 890-2175       | 20 miles, uphill both ways.     |
+-------------------+----------------------+---------------------------------+
|           ...! [ princeton mit-eddie bu sunne ] !polygen!jerry             |
|                            jerry@polygen.com                               |
+----------------------------------------------------------------------------+

philip@utstat.uucp (Philip McDunnough) (05/04/91)

In article <8992@crash.cts.com> psonnek@pro-mansion.cts.com (Patrick Sonnek) writes:

[quotes ongoing discussion of HLL's vs assembler and an opinion on the
 need for an application to run faster if it runs well]

>I don't consider speed to be pointless.  My time is valuable. I don't want to
>have to wait around for inefficient code.  I think you will find my opinions
>about speed to be consistant with the users (as opposed to programmers) out
>there in the world.

I consider myself a user, and I suppose my time is valuable. This endless
search for more speed certainly doesn't make me want to jump up and down.
I'd rather see elegance, beauty, thought,etc...put into problems than
lines of programmers/users churning out endless code and speadsheets.

I'm not saying that there isn't a time and place for fast computers. I am
saying it's relatively unimportant in the larger scheme of things. There
are more important issues involved with computing, from a uer's point of
view at least, than just getting anything to run faster.

Philip McDunnough
University of Toronto
[my opinions,etc...]

rhyde@ucrmath.ucr.edu (randy hyde) (05/06/91)

>>>>
What a joke.  If you want the application to run on your computer AT ALL,
you'd better care about portability.
<<<<

I write lots of device drivers for PCs.  I can promise you they are not
portable.  They work fine on my computer.  I don't care if they're
portable to a Mac or an Apple II.

There is a big problem with portable "Applications."  In general, they have
to be written to the lowest common denominator.  I'm sorry, but people have
been arguing this once one microcomputers since the TRS-80 vs. Apple II days.
The non-portable products always won out in the marketplace.  Even Word Perfect
(who has ported their application to more diverse systems than anyone else
I know of) doesn't write portable software.  They've written specific
versions for each platform.

Most people won't spend a lot of money on application which barely run on
a machine (i.e., AT ALL).  They want those applications to take advantage of
the features of the machine.  Look at how many people around here who are
complaining because software vendors only support generic Apple II computers
and don't support the gee whiz features of the GS.  Hey, it's portable if it
runs on the II.  That doesn't make GS owners very happy.  They'll buy such
products grudingly, if at all.
*** RAndy Hyde

dat33228@uxa.cso.uiuc.edu (Derek A. Taubert) (05/07/91)

In article <1074@stewart.UUCP> jerry@stewart.UUCP (Jerry Shekhel) writes:
>psonnek@pro-mansion.cts.com (Patrick Sonnek) writes:
>>
>>This is a good thing?!?  As has been stated before by someone else (Forgive
>>me, I've forgotten who.)  I don't care if it's portable, I want the
>>application to run on MY computer FAST!
>>
>
>What a joke.  If you want the application to run on your computer AT ALL,
>you'd better care about portability.

Excuse me for returning the same arrogance that you just so wonderfully
displayed, but do you even know what portability means?

--
+ Derek Taubert --> derek@mrcnext.cso.uiuc.edu + Author of : GScii+	      +
+		    dat33228@uxa.cso.uiuc.edu  + and the world's most useless +
++++++++++++++++++++++++++++++++++++++++++++++++ desk accessory -> Amaze me   +
+ There are MOUSE technotes?    +  *******8-)  ++++++++++++++++++++++++++++++++
+ Ask me about my GS load meter +  ^^^^^^^^^^ Marge Simpson                   +