xanthian@saturn.ADS.COM (Metafont Consultant Account) (02/18/90)
In article <7416@ogicse.ogc.edu> daniels@ogicse.ogc.edu (Scott David Daniels) writes: >the hardware word size was thirty-two bits, and (since % produced inconsistent >results on our two target machines for full-sized words), Yep. It is utterly amazing how many modulus inplementations exist where (-1) mod m isn't m-1. This makes cellular automaton programs add half a dozen extra instructions per loop to treat a rectanglar array as a toroid. I don't know who originated this bogosity, but it would be nice if hardware and compiler folks would extirpate it once and for all from all machines and all languages, so programs using mod could port without stumbling over this implementation error. -- xanthian@ads.com xanthian@well.sf.ca.us (Kent Paul Dolan) Again, my opinions, not the account furnishers'.
dietz@cs.rochester.edu (Paul Dietz) (02/21/90)
In article <1990Feb15.211210.22950@max.sunysb.edu> rosalia@max.sunysb.edu (Mark Galassi) writes: >I would like to know which is a good hash function for strings. >I was going to compute the index into the hash table with > index = hashval(s) % HASH_LEN; >where s is a string, and HASH_LEN is a prime number and is also >the size of the hash table. > >I have seen two schemes: > >1. hashval(s) is the sum of the ascii values of each char in the string. > >2. hashval(s) is the sum of the ascii values*256^i where i is the > position of the char in the string. > >Which of these is better, or is there any other? Knuth seems to not >go into detail on hashval() for strings. If you want to be rigorous, you should talk about good *classes* of hash functions, since, for any particular hash function, there are inputs on which is behaves terribly. What you might want are universal classes of hash functions (strictly speaking, universal-2 classes). A class of hash functions U is universal if, for any two distinct keys x and y, |U|/HASH_LEN functions in U hash x and y to the same value. So, if you pick a random element of U, the chance x and y will collide is 1/HASH_LEN. See, for example, Cormen, Leiserson and Rivest (Introduction to Algorithms, MIT Press, Spring 1990) or Brassard and Bratley (Algorithmic: Theory and Practice, Prentice Hall, 1988). Here's a universal class of hash functions for strings, if HASH_LEN is a prime number. Let a[0], a[1], ... be random numbers from the set {0,...,HASH_LEN-1}. Then, the hash function is h(s) = (a[0]*s[0] + a[1]*s[1] + ... ) mod HASH_LEN This does one multiplication per character, but you could save on that by packing characters into longer words (but the resulting numbers must still be < HASH_LEN) using these words instead of the s[i] in the definition. Here's another possibility, if HASH_LEN is a power of two. For each ascii character c and each position in the string i, pick a random number a[i][c] from {0,...,HASH_LEN-1}. Then, a universal class of hash functions is defined by h(s) = a[0][s[0]] XOR a[1][s[1]] XOR ... This is particularly easy to compute, involving no multiplication (save any that might be required for the array accesses) or modulus operations, but it does require one additional memory reference per character. Paul F. Dietz dietz@cs.rochester.edu