[comp.lang.c] Favorite hashing algorithms

noren@dinl.uucp (Charles Noren) (06/13/90)

I am interested in hashing algorithms that work effectively
on strings that have sizes up to 32, 64 or more characters.
Some examples of strings include:

 "create", "create:with:", "doThis:to:with:", 
 "_how1AboutThis:with2:and3:"

(look familiar Smalltalk fans?).  The strings can contain
embedded digits, ':', '_', '-', maybe all the "printable"
ASCII character set, but with an emphasis on upper and
lower case alphabetic characters with :'s.

One function I recall from XLISP, if my memory serves me
correctly, is:
/*
 * hash
 * Return an index to a hash table.
 *
 * SYNOPSIS
 *     int hash(char *str, int hashTableLength);
 *
 * DESCRIPTION
 *     Return a hash "index" or value of the null-terminated
 *     string pointed to by str.  The value may range from
 *     0 to (hashTableLength - 1) inclusive.
 *
 */

int hash(char *str, int hashTableLength)
{
    int hashNumber;

    /*
     * For each character of the entire string
     * (until the null character is reached)
     * left shift the previous iteration result
     * by 2 and XOR it with the current character.
     */
    for(hashNumber = 0; *str; )
       hashNumber = (hashNumber << 2) ^ *str++;

    /*
     * Get the remainder of accumulated hash value
     * divided by hashTableLength.  Make sure the
     * result is non-negative.
     */
    hashNumber %= hashTableLength;
    return(hashNumber < 0? -hashNumber : hashNumber);
}


Limited running of this seemed to produce very few collisions.

Some questions:
 1.  What limitations does this hash function have.
     Are there some hashTableLength values that are
     better than others (e.g., should it be prime?).
 2.  Are there better and faster hashing functions?
 3.  What are some other variant hashing functions
     in case I choose a "hash on collision" approach
     to storing keys.
 4.  Knuth (The Art of Computer Programming, vol ??)
     has division and multiplication algorithms in
     his hashing section.  Can that be conveniently
     applied in a fast hashing function for strings
     of arbitrary size?

While I've cross-posted this to comp.lang.c, sci.math,
and comp.theory, I've set follow-ups to comp.theory.

I'll be happy to summarize e-mail responses in comp.theory
if there is sufficient interest.

Thanks.

-- 
Chuck Noren
NET:     ncar!dinl!noren
US-MAIL: Martin Marietta I&CS, MS XL8058, P.O. Box 1260,
         Denver, CO 80201-1260
Phone:   (303) 971-7930