bamford@randvax.ARPA (Cliff Bamford) (02/12/84)
In response to the person who wanted to know why prime numbers are good choices for hash table sizes, the following is almost verbatim from Knuth, ACP vol 3, 2nd printing, pp 508f: There are three basic hash methods: modulo, multiplicative, and polynomial. Modulo is the most popular since it can be both fast and easy to implement. Given a key, K, the hash function h(K) = K mod M insures that 0 <= h(K) < M, placing a convienient upper bound on h. Some values of M are much better than others. For example, if M is even, h(K) will be even when K is, and odd otherwise. This will lead to a substantial bias (clustering) in many files. It would be even worse to let M be a power of the radix of the computer, since K mod M would then be simply the least significant bits of K, which loses the chance to let the highorder of K participate in the collision-avoidance process. Similarly, M probably shoudn't be a multiple of 3, since, if the keys are alphabetic, two keys which differ only by a permutation of letters would then differ in value by a multiple of 3, since 10**n mod 3 = 4**n mod 3 = 1). In general, we want to avoid values of M which divide r**k +/- a, where k and a are small numbers and r is the radix of the alphabetic character set (typically 64, 128, 256 or 100), since a remainder modulo such a value of M tends to be largely a simple superposition of the key digits. BY INDUCTION, THE 4TH PARAGRAPH ABOVE SUGGESTS THAT M BE PRIME, SINCE CORRESPONDING ARGUMENTS CAN BE CONTRUCTED OTHERWISE. AND THE 5TH PARAGRAPH COMPELS A CHOICE OF M TO BE SUCH THAT r**k = +/- a mod M FOR SMALL k AND a [PARAPHRASE BY CLIFF BAMFORD]. Hope this helps. Knuth devotes a whole section (6.4) to hashing, if you want more details. Cliff Bamford Nonresident consultant at Rand, 2902 Pine St, SFO, 94115.