brinkema@fjc.GOV (John R. Brinkema) (06/19/91)
I have developed an program to do a 'quick' copy of a large database (~1Gb) from one machine to another. The idea is to update an existing old copy of the database on the second machine by only copying the changes on the first and updating the second. These changes are detected by comparing before and after CRC-32 (CCITT) ckecksum on (currently) 4Kbyte chunks of the data bases. If the CRC's are different, I copy and update; otherwise I assume that the blocks are the same. In the enviroment that the program runs, this is a reasonalbe idea: the first database file is the master; the second is a slave of the first and we can tolerate some (small, I believe) probability that the second database is different (heck, lets use the C word: corrupted). In fact, in practice no problems have been encountered; but we do a full copy of the master to slave database periodically. This copy has saved us a tremendous amount of time and our operations staff loves it. On the otherhand, we *do* rely on the database and I am sure that if I messed it up I would have some very unhappy users (including my boss's boss). In a fit of optimism that I now regret, I recently suggested that *backups* on the first machine could be speed up by doing incremental back- ups using the same algorithm. That is, bet-it-all on the goodness of this alogorithm to identify changed blocks with a probability less than or equal the probability that media or operations or god (who was recently given a pair of dice by a downstairs neighbor) does something horrible and unrecoverable anyway. The database changes are random ASCII and Integer fields in a packed record randomly placed in the chunk. I have no specific informantion on the pattern of bit changes between CRC'ed chunks; I assume they are essentially random rather than burst or periodic or something nicely mathematical. I am willing to tolerate false-differences (other than all chunks -- thats not fair). So, finally the questions: how bad is this algorithm? Is there some other checksum that will give better results - that is, better probability that different chunks checksum differently. What *is* the probability that different chunks checksum the same (as a function of something I can control such as the size of the chuck). Or, in otherwords, what is the probability that I will be fired? jb.
jgk@osc.COM (Joe Keane) (06/22/91)
My rule of thumb is that you can run out of 32 bits, but you won't run out of 64 bits. In other words, many situations can have 2^32 things, but very few have 2^64 things. For example, a fairly large disk contains 2^32 bits, and you can execute 2^32 instructions in a few minutes. So if you have an event which occurs with probability 2^-32, it will probably happen eventually, while 2^-64 is likely to never happen. Note that you have to double the number of bits for `birthday' problems, where any matched pair is the relevant event. For example, suppose you're making Ethernet addresses at random, and you don't want to have any two the same. Using 64 bits is risky, because when you have a billion machines, there's a good chance that there will be two with the same address. On the other hand, with 128 bits i'd be confident that this will never happen. Of course, this assumes that the addresses are really random, and there's no hidden pattern. For the application described, i think 32 bits is not enough. It's likely that eventually some block will mistakenly not be backed up. I'd say 64 bits is pretty safe. However, since the application does a lot of disk work anyway, the hashing probably isn't a critical part. So i'd just use 128 bits. Also, i wouldn't particularly trust CRC for this application. There are some patterns of changes that `fall through'. It's great for dealing with random bit errors, since it's extremely unlikely for them to cause this to happen. But i wouldn't use it for comparing structured data, since then there may be some data which happens to hit this pattern. Then the probability of making a mistake becomes much higher. Making good hashing functions is something you could write a book about. It's closely related to making good encryption functions. I'll give a quick summary. First, don't make it affine, like CRC. The bits should get mixed up well. Second, make sure you don't lose information. In other words, your state transition function should be bijective as a function of the old state. Given those two, i think you can do what you want and have it come out OK. -- Joe Keane, amateur mathematician jgk@osc.com (...!uunet!stratus!osc!jgk)