[comp.lang.c] general hash functions for text: SUMMARY!

cpshelley@watmsg.waterloo.edu (cameron shelley) (10/19/90)

Hello!

  Well, once again I'd like to thank all who sent me suggestions.
I'll now post the results of my tests so far.  I have only tested
those functions which were easily adapted to my existing problem,
I'll be getting to the more specialized ones later.  Btw, my
problem involves storing (and later indexing) the vocabulary read
from a database of the Montreal Gazette, an average (:>) Montreal
daily.  I have tested about eight of the functions suggested on
roughly 10Mb of the database - about 30,000 distinct words.  The
size of ROOTLEN (the array of pointers to the structs I'm using)
is either 64997 (a prime!) for algorithms that require primes for
a modulus, or 65536 for routines needing a power of two.  I think
these are close enough to give a fair test.

Thank you to:

From: Larry Watanabe <watanabe@cs.uiuc.edu>
From: raymond@math.berkeley.edu (Raymond Chen)
From: dmorin@wpi.wpi.edu (Duane D Morin)
From: Mark Pledger <cti1!mpledger@uunet.UU.NET>
From: drh@cs.duke.edu (D. Richard Hipp)
Message-Id: <656143040.AA19603@blekko.commodore.com>    (sorry! :)
From: mccaugh@suna0.cs.uiuc.edu
From: jeff@kfw.com (Jeff Henkels)
From: afoiani@NMSU.Edu
From: skrenta@cbmvax.cbm.commodore.com (Rich Skrenta)
From: mike@cosmos.acs.calpoly.edu (Mike Maughmer)
From: brnstnd@KRAMDEN.ACF.NYU.EDU (Dan Bernstein)
From: Paul John Falstad <pfalstad@phoenix.Princeton.EDU>
From: lfd@lcuxlq.att.com (Leland F Derbenwick)
From: Alan J Rosenthal <flaps@dgp.toronto.edu>
From: ath@prosys.se (Anders Thulin)
From: gaynor@paul.rutgers.edu (Silver)
From: chris@mimsy.umd.edu (Chris Torek)
From: macphee@convex.COM (Scott C. Mac Phee)


The functions (as adapted to my needs) are: 

/**** from _Aho, Sethi, Ullman_'s dragon book ****/

unsigned long hash(char *word) {

  unsigned long h, g;

  for (h = 0; *word; word++) {
    h = (h << 4) + (*word);
      if (g = (h & 0xf0000000)) {
	h ^= (g >> 24);
	h ^= g;
	}
    }
  return h % ROOTLEN;

} /* end of hash */



/**** from Sedgewick's _Algorithms_ ****/

unsigned long hash(char *word) {

  unsigned long h;

    h = *word++;
    while (*word)
      h = (h * 128 + *word++) % ROOTLEN;
    return h;

} /* end of hash */



/**** from Gosling's emaccs ****/

unsigned long hash(char *word) {

  unsigned long h = 0;

  while (*word)
    h = (h << 5) - h + *word++;
  
  return h & ROOTLEN;           /* ROOTLEN must be a power of 2, ie. 2^16 */

} /* end of hash */


The function from Gosling's emaccs nosed the others out by maybe one
percentage point on the ratio of collisions/probes.  Note that these
are general hashing functions for TEXT, more specialized ones (I'm
told) will likely have a better performance.  Also remember that I 
tested these on full newspaper articles, which may be different from
what you need...  I have also modified what I got for clarity, so no
doubt you can compress the way these functions are represented.

Anyway, thanks again, and I hope someone can make use of this summary!

				Cam

--
      Cameron Shelley        | "Armor, n.  The kind of clothing worn by a man
cpshelley@violet.waterloo.edu|  whose tailor is a blacksmith."
    Davis Centre Rm 2136     |
 Phone (519) 885-1211 x3390  |				Ambrose Bierce

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (10/20/90)

In article <1990Oct19.040234.20794@watmath.waterloo.edu> cpshelley@watmsg.waterloo.edu (cameron shelley) writes:
>   return h & ROOTLEN;           /* ROOTLEN must be a power of 2, ie. 2^16 */

Just to make sure it gets said: This should probably be h % ROOTLEN, or
h & (ROOTLEN - 1). The latter is faster under most computers and
compilers.

---Dan

djones@megatest.UUCP (Dave Jones) (10/25/90)

I just now popped in to comp.lang.c for a peek. See that string-hashing
is the hot topic. For years I've been using the following hash-function
for tables whose sizes are a power of two. (Dividing by primes is
*slowwww*. Never did understand what it gains.) This is the best I could come
up with for mc68000's. I don't expect that SPARCification will cause me
to change it.

int
String_hash(str)
  	register char *str;
{
  register int retval = 0;
  while(*str)
     retval += retval + (unsigned char)(*str++);
  return retval;
}


I've got an original scheme for rehashing that I just love to show off.
Later maybe.