[comp.theory] What is the probability that I will be fired?

brinkema@fjc.GOV (John R. Brinkema) (06/19/91)

I have developed an program to do a 'quick' copy of a large database (~1Gb)
from one machine to another.  The idea is to update an existing old copy
of the database on the second machine by only copying the changes on the
first and updating the second.  These changes are detected by comparing
before and after CRC-32 (CCITT) ckecksum on (currently) 4Kbyte chunks of
the data bases.  If the CRC's are different, I copy and update; otherwise
I assume that the blocks are the same.  In the enviroment that the program
runs, this is a reasonalbe idea: the first database file is the master;
the second is a slave of the first and we can tolerate some (small, I believe)
probability that the second database is different (heck, lets use the C
word: corrupted).  In fact, in practice no problems have been encountered; but
we do a full copy of the master to slave database periodically.  This copy
has saved us a tremendous amount of time and our operations staff loves it.
On the otherhand, we *do* rely on the database and I am sure that if I
messed it up I would have some very unhappy users (including my boss's boss).

In a fit of optimism that I now regret, I recently suggested that
*backups* on the first machine could be speed up by doing incremental back-
ups using the same algorithm.  That is, bet-it-all on the goodness of this
alogorithm to identify changed blocks with a probability less than or equal
the probability that media or operations or god (who was recently given a pair
of dice by a downstairs neighbor) does something horrible and unrecoverable
anyway.

The database changes are random ASCII and Integer fields in a packed record
randomly placed in the chunk.  I have no specific informantion on the pattern
of bit changes between CRC'ed chunks; I assume they are essentially random
rather than burst or periodic or something nicely mathematical.  I am willing
to tolerate false-differences (other than all chunks -- thats not fair).

So, finally the questions: how bad is this algorithm?  Is there some other
checksum that will give better results - that is, better probability that
different chunks checksum differently.  What *is* the probability that
different chunks checksum the same (as a function of something I can control
such as the size of the chuck).

Or, in otherwords, what is the probability that I will be fired?  jb.

jgk@osc.COM (Joe Keane) (06/22/91)

My rule of thumb is that you can run out of 32 bits, but you won't run out of
64 bits.  In other words, many situations can have 2^32 things, but very few
have 2^64 things.  For example, a fairly large disk contains 2^32 bits, and
you can execute 2^32 instructions in a few minutes.  So if you have an event
which occurs with probability 2^-32, it will probably happen eventually, while
2^-64 is likely to never happen.

Note that you have to double the number of bits for `birthday' problems, where
any matched pair is the relevant event.  For example, suppose you're making
Ethernet addresses at random, and you don't want to have any two the same.
Using 64 bits is risky, because when you have a billion machines, there's a
good chance that there will be two with the same address.  On the other hand,
with 128 bits i'd be confident that this will never happen.  Of course, this
assumes that the addresses are really random, and there's no hidden pattern.

For the application described, i think 32 bits is not enough.  It's likely
that eventually some block will mistakenly not be backed up.  I'd say 64 bits
is pretty safe.  However, since the application does a lot of disk work
anyway, the hashing probably isn't a critical part.  So i'd just use 128 bits.

Also, i wouldn't particularly trust CRC for this application.  There are some
patterns of changes that `fall through'.  It's great for dealing with random
bit errors, since it's extremely unlikely for them to cause this to happen.
But i wouldn't use it for comparing structured data, since then there may be
some data which happens to hit this pattern.  Then the probability of making a
mistake becomes much higher.

Making good hashing functions is something you could write a book about.  It's
closely related to making good encryption functions.  I'll give a quick
summary.  First, don't make it affine, like CRC.  The bits should get mixed up
well.  Second, make sure you don't lose information.  In other words, your
state transition function should be bijective as a function of the old state.
Given those two, i think you can do what you want and have it come out OK.
--
Joe Keane, amateur mathematician
jgk@osc.com (...!uunet!stratus!osc!jgk)