[comp.misc] Anybody have a checksum algorithm that detects byte-swap?

friedl@vsi.UUCP (Stephen J. Friedl) (06/27/88)

Hi,

     I am writing some sort programs on two different machines
and really don't want to move megabyte files around to see if the
output from identically-run programs is the same.  I figure that
I can be moderately safe by using a 32-bit checksum, but the UNIX
sum(1) is inadequate because it appears to be a simple additive
checksum; it won't detect bytes moving around in a file ("abc"
and "cba" both have the same checksum).

     I have a naive algorithm of multiplying the byte just read
with the byte number:

		while (c = getchar(), c != EOF)
			sum += (c * ++count);

     It strikes me that this is an area where a lack of rigor
could be troublesome: Any of you ECC folks got anything better?
I don't mind doing a little bit more work to make it sensitive to
"common" things that can go wrong with a file.  Would a CRC be
overkill?  Any suggestions?

     Please email, I'll summarize, etc.

     Steve

-- 
Steve Friedl     V-Systems, Inc. (714) 545-6442     3B2-kind-of-guy
friedl@vsi.com     {backbones}!vsi.com!friedl    attmail!vsi!friedl

Nancy Reagan on the Free Software Foundation : "Just say GNU"

jgk@speech2.cs.cmu.edu (Joe Keane) (06/28/88)

I think you'll be fine with a double checksum:

	while ((c = getchar ()) != EOF)
		sum2 += sum += c;

This is very fast, you can add more sums if you want.  If you're
paranoid you can use a CRC code; there are probably many PD
implementations.

--Joe

jgk@speech2.cs.cmu.edu (Joe Keane) (06/28/88)

Sorry guys, i meant to mail that.

--Joe

oster@dewey.soe.berkeley.edu (David Phillip Oster) (06/28/88)

Just rotate the sum one bit left after each add.

	while(count--){
		sum += *src++;
		highbit = sum & 0x8000;	/* for a 16 bit sum */
		sum  <<= 1;
		if(highbit){	/* move the high bit around to the low */
			sum |= 1;
		}
	}

example:
  1, 0 gives sum=2.
  0, 1 gives sum=1.


--- David Phillip Oster            --When you asked me to live in sin with you
Arpa: oster@dewey.soe.berkeley.edu --I didn't know you meant sloth.
Uucp: {uwvax,decvax,ihnp4}!ucbvax!oster%dewey.soe.berkeley.edu

jfh@rpp386.UUCP (John F. Haugh II) (06/29/88)

In article <735@vsi.UUCP> friedl@vsi.UUCP (Stephen J. Friedl) writes:
>     I am writing some sort programs on two different machines
>and really don't want to move megabyte files around to see if the
>output from identically-run programs is the same.

>     I have a naive algorithm of multiplying the byte just read
>with the byte number:
>
>		while (c = getchar(), c != EOF)
>			sum += (c * ++count);

naive, i'll say ;-)

i doubt you'll ever overflow a 32bitter, but a 16 bit machine will
overflow after (possibly) 256 characters, assuming a 16 bit sum.
[ unless you go checking 24MB files ;-) ]

i suggest trying something more random -

		long sum;

		while (c = getchar (), c != EOF)
			sum = (((sum <<  1) & 0xfffffffe) |
			       ((sum >> 31) & 0x00000001)) ^ c;

(in other words, a rotate left one followed by xor-ing in the
character).  this should be as fast or faster than yours (no multiply),
and it shouldn't ever overflow.  it's also not as complex as a full
blown CRC16.  i've used similiar code for hashing functions with
nice results.

- john.
-- 
John F. Haugh II                 +--------- Cute Chocolate Quote ---------
HASA, "S" Division               | "USENET should not be confused with
UUCP:   killer!rpp386!jfh        |  something that matters, like CHOCOLATE"
DOMAIN: jfh@rpp386.uucp          |             -- with my apologizes

rhorn@infinet.UUCP (Rob Horn) (06/29/88)

In article <735@vsi.UUCP> friedl@vsi.UUCP (Stephen J. Friedl) writes:
> [wants a checksum-like algorithm that detects transpositions]
>
>     I have a naive algorithm of multiplying the byte just read
>with the byte number:
>		while (c = getchar(), c != EOF)
>			sum += (c * ++count);

Use CRC's.  They are faster.  A table-driven CRC uses only seven simple
instructions per byte, versus three for the basic checksum.  For a few
more instructions you can avoid the table.  The core loop looks like

  while (...) {
    crcptr = CRC_TABLE( c ^ crc[low_byte]);
    crc[low_byte] = crcptr[low_byte] ^ crc[high_byte]
    crc[high] = crcptr[high_byte]
  }

which can of course be optimized further depending on how the rest of
the program is structured.  Building CRC_TABLE is more work.

For a table free but slightly slower version, see Frank DaCruz's book
on Kermit.  It will compute CRC's about as fast as the above multiply
code.
-- 
				Rob  Horn
	UUCP:	...harvard!adelie!infinet!rhorn
		...ulowell!infinet!rhorn, ..decvax!infinet!rhorn
	Snail:	Infinet,  40 High St., North Andover, MA