[net.followup] Hash size selection

bamford@randvax.ARPA (Cliff Bamford) (02/12/84)
In response to the person who wanted to know why prime numbers
are good choices for hash table sizes, the following is almost
verbatim from Knuth, ACP vol 3, 2nd printing, pp 508f:

There are three basic hash methods: modulo, multiplicative, and
polynomial. Modulo is the most popular since it can be both fast and easy
to implement. Given a key, K, the hash function h(K) = K mod M insures
that 0 <= h(K) < M, placing a convienient upper bound on h.

Some values of M are much better than others. For example, if M is even,
h(K) will be even when K is, and odd otherwise. This will lead to a
substantial bias (clustering) in many files. It would be even worse to
let M be a power of the radix of the computer, since K mod M would then
be simply the least significant bits of K, which loses the chance to
let the highorder of K participate in the collision-avoidance process.

Similarly, M probably shoudn't be a multiple of 3, since, if the keys
are alphabetic, two keys which differ only by a permutation of letters
would then differ in value by a multiple of 3, since 10**n mod 3 =
4**n mod 3 = 1).

In general, we want to avoid values of M which divide r**k +/- a, where
k and a are small numbers and r is the radix of the alphabetic character
set (typically 64, 128, 256 or 100), since a remainder modulo such a value
of M tends to be largely a simple superposition of the key digits.

BY INDUCTION, THE 4TH PARAGRAPH ABOVE SUGGESTS THAT M BE PRIME, SINCE
CORRESPONDING ARGUMENTS CAN BE CONTRUCTED OTHERWISE. AND THE 5TH PARAGRAPH
COMPELS A CHOICE OF M TO BE SUCH THAT r**k = +/- a mod M FOR SMALL k AND
a [PARAPHRASE BY CLIFF BAMFORD].

Hope this helps. Knuth devotes a whole section (6.4) to hashing, if
you want more details.

Cliff Bamford
Nonresident consultant at Rand,
2902 Pine St, SFO, 94115.