friedl@vsi.UUCP (Stephen J. Friedl) (06/27/88)
Hi, I am writing some sort programs on two different machines and really don't want to move megabyte files around to see if the output from identically-run programs is the same. I figure that I can be moderately safe by using a 32-bit checksum, but the UNIX sum(1) is inadequate because it appears to be a simple additive checksum; it won't detect bytes moving around in a file ("abc" and "cba" both have the same checksum). I have a naive algorithm of multiplying the byte just read with the byte number: while (c = getchar(), c != EOF) sum += (c * ++count); It strikes me that this is an area where a lack of rigor could be troublesome: Any of you ECC folks got anything better? I don't mind doing a little bit more work to make it sensitive to "common" things that can go wrong with a file. Would a CRC be overkill? Any suggestions? Please email, I'll summarize, etc. Steve -- Steve Friedl V-Systems, Inc. (714) 545-6442 3B2-kind-of-guy friedl@vsi.com {backbones}!vsi.com!friedl attmail!vsi!friedl Nancy Reagan on the Free Software Foundation : "Just say GNU"
jgk@speech2.cs.cmu.edu (Joe Keane) (06/28/88)
I think you'll be fine with a double checksum: while ((c = getchar ()) != EOF) sum2 += sum += c; This is very fast, you can add more sums if you want. If you're paranoid you can use a CRC code; there are probably many PD implementations. --Joe
jgk@speech2.cs.cmu.edu (Joe Keane) (06/28/88)
Sorry guys, i meant to mail that. --Joe
oster@dewey.soe.berkeley.edu (David Phillip Oster) (06/28/88)
Just rotate the sum one bit left after each add. while(count--){ sum += *src++; highbit = sum & 0x8000; /* for a 16 bit sum */ sum <<= 1; if(highbit){ /* move the high bit around to the low */ sum |= 1; } } example: 1, 0 gives sum=2. 0, 1 gives sum=1. --- David Phillip Oster --When you asked me to live in sin with you Arpa: oster@dewey.soe.berkeley.edu --I didn't know you meant sloth. Uucp: {uwvax,decvax,ihnp4}!ucbvax!oster%dewey.soe.berkeley.edu
jfh@rpp386.UUCP (John F. Haugh II) (06/29/88)
In article <735@vsi.UUCP> friedl@vsi.UUCP (Stephen J. Friedl) writes: > I am writing some sort programs on two different machines >and really don't want to move megabyte files around to see if the >output from identically-run programs is the same. > I have a naive algorithm of multiplying the byte just read >with the byte number: > > while (c = getchar(), c != EOF) > sum += (c * ++count); naive, i'll say ;-) i doubt you'll ever overflow a 32bitter, but a 16 bit machine will overflow after (possibly) 256 characters, assuming a 16 bit sum. [ unless you go checking 24MB files ;-) ] i suggest trying something more random - long sum; while (c = getchar (), c != EOF) sum = (((sum << 1) & 0xfffffffe) | ((sum >> 31) & 0x00000001)) ^ c; (in other words, a rotate left one followed by xor-ing in the character). this should be as fast or faster than yours (no multiply), and it shouldn't ever overflow. it's also not as complex as a full blown CRC16. i've used similiar code for hashing functions with nice results. - john. -- John F. Haugh II +--------- Cute Chocolate Quote --------- HASA, "S" Division | "USENET should not be confused with UUCP: killer!rpp386!jfh | something that matters, like CHOCOLATE" DOMAIN: jfh@rpp386.uucp | -- with my apologizes
rhorn@infinet.UUCP (Rob Horn) (06/29/88)
In article <735@vsi.UUCP> friedl@vsi.UUCP (Stephen J. Friedl) writes: > [wants a checksum-like algorithm that detects transpositions] > > I have a naive algorithm of multiplying the byte just read >with the byte number: > while (c = getchar(), c != EOF) > sum += (c * ++count); Use CRC's. They are faster. A table-driven CRC uses only seven simple instructions per byte, versus three for the basic checksum. For a few more instructions you can avoid the table. The core loop looks like while (...) { crcptr = CRC_TABLE( c ^ crc[low_byte]); crc[low_byte] = crcptr[low_byte] ^ crc[high_byte] crc[high] = crcptr[high_byte] } which can of course be optimized further depending on how the rest of the program is structured. Building CRC_TABLE is more work. For a table free but slightly slower version, see Frank DaCruz's book on Kermit. It will compute CRC's about as fast as the above multiply code. -- Rob Horn UUCP: ...harvard!adelie!infinet!rhorn ...ulowell!infinet!rhorn, ..decvax!infinet!rhorn Snail: Infinet, 40 High St., North Andover, MA