[comp.theory] hashing function for strings

xanthian@saturn.ADS.COM (Metafont Consultant Account) (02/18/90)

In article <7416@ogicse.ogc.edu> daniels@ogicse.ogc.edu (Scott David Daniels) writes:

>the hardware word size was thirty-two bits, and (since % produced inconsistent
>results on our two target machines for full-sized words),

Yep.  It is utterly amazing how many modulus inplementations exist
where (-1) mod m isn't m-1.  This makes cellular automaton programs
add half a dozen extra instructions per loop to treat a rectanglar
array as a toroid.  I don't know who originated this bogosity, but it
would be nice if hardware and compiler folks would extirpate it once
and for all from all machines and all languages, so programs using mod
could port without stumbling over this implementation error.
--
xanthian@ads.com xanthian@well.sf.ca.us (Kent Paul Dolan)
Again, my opinions, not the account furnishers'.

dietz@cs.rochester.edu (Paul Dietz) (02/21/90)

In article <1990Feb15.211210.22950@max.sunysb.edu> rosalia@max.sunysb.edu (Mark Galassi) writes:

>I would like to know which is a good hash function for strings.
>I was going to compute the index into the hash table with
>	index = hashval(s) % HASH_LEN;
>where s is a string, and HASH_LEN is a prime number and is also
>the size of the hash table.
>
>I have seen two schemes:
>
>1. hashval(s) is the sum of the ascii values of each char in the string.
>
>2. hashval(s) is the sum of the ascii values*256^i where i is the
>   position of the char in the string.
>
>Which of these is better, or is there any other?  Knuth seems to not
>go into detail on hashval() for strings.

If you want to be rigorous, you should talk about good *classes*
of hash functions, since, for any particular hash function, there
are inputs on which is behaves terribly.

What you might want are universal classes of hash functions (strictly
speaking, universal-2 classes).  A class of hash functions U is
universal if, for any two distinct keys x and y, |U|/HASH_LEN
functions in U hash x and y to the same value.  So, if you pick a
random element of U, the chance x and y will collide is 1/HASH_LEN.
See, for example, Cormen, Leiserson and Rivest (Introduction to
Algorithms, MIT Press, Spring 1990) or Brassard and Bratley
(Algorithmic: Theory and Practice, Prentice Hall, 1988).

Here's a universal class of hash functions for strings, if HASH_LEN is
a prime number.  Let a[0], a[1], ... be random numbers from the set
{0,...,HASH_LEN-1}.  Then, the hash function is

	h(s) = (a[0]*s[0] + a[1]*s[1] + ... ) mod HASH_LEN

This does one multiplication per character, but you could save on
that by packing characters into longer words (but the resulting
numbers must still be < HASH_LEN) using these words instead of
the s[i] in the definition.

Here's another possibility, if HASH_LEN is a power of two.  For each
ascii character c and each position in the string i, pick a random
number a[i][c] from {0,...,HASH_LEN-1}.  Then, a universal class
of hash functions is defined by

	h(s) = a[0][s[0]] XOR a[1][s[1]] XOR ...

This is particularly easy to compute, involving no multiplication
(save any that might be required for the array accesses) or modulus
operations, but it does require one additional memory reference per
character.

	Paul F. Dietz
	dietz@cs.rochester.edu