[comp.lang.c] hash algorithm

raab@novavax.UUCP (Moshe Raab) (08/17/88)

could anyone recommend a hashing algorithm to store a list of 10
digit phone numbers (area code plus 7 digits).
It should have little or no overflow and no collisions (as few as
possible). The list will contain more than one area code but
about 1000 numbers per area code (ie not a totasly random sample
but one which has a relatively common prefix)
thank you very much.

decot@hpisod2.HP.COM (Dave Decot) (08/17/88)

> could anyone recommend a hashing algorithm to store a list of 10
> digit phone numbers (area code plus 7 digits).
> It should have little or no overflow and no collisions (as few as
> possible). The list will contain more than one area code but
> about 1000 numbers per area code (ie not a totasly random sample
> but one which has a relatively common prefix)
> thank you very much.

Treat the ten digits as an integer, n, and compute the index
as (n % HASH_SIZE), where HASH_SIZE is the size of your hash table.

Store the number, and the record for that number, in the first empty
hash table location at or after the index you've computed.

HASH_SIZE should be about four times the maximum number of phone numbers
you expect to store, to reduce collisions.

What data is associated with each phone number?  What do you intend
to do with the list?  Do you want to look up things by phone number,
or by some other key?

Dave

djones@megatest.UUCP (Dave Jones) (08/18/88)

From article <2550078@hpisod2.HP.COM>, by decot@hpisod2.HP.COM (Dave Decot):
>> could anyone recommend a hashing algorithm to store a list of 10
>> digit phone numbers (area code plus 7 digits).
>> It should have little or no overflow and no collisions (as few as
>> possible). The list will contain more than one area code but
>> about 1000 numbers per area code (ie not a totasly random sample
>> but one which has a relatively common prefix)
>> thank you very much.
> 
> Treat the ten digits as an integer, n, and compute the index
> as (n % HASH_SIZE), where HASH_SIZE is the size of your hash table.
> 
> Store the number, and the record for that number, in the first empty
> hash table location at or after the index you've computed.

...


There are a couple of rather serious problems with the algorithm suggested.
One is that the high order bits in the phone numbers will not affect
the hash.  Say for example you store all of radio station KGO's talk 
numbers, (415) 263-TALK, (408) 469-TALK, etc. (or whatever they are).
They all map to the same slot.

Secondly, the "first-empty-slot-after" algorithm tends to build up clumps
which the subsequent linear searches have to trace through, adding to
the size of the clump, and thus the likelyhood that it will get hit
next time.

I have devised a method for hashing which has not been published previously.
It is very very fast. Here is the WORLD PREMIERE!!

First let's settle the business of how to hash integers into the
range 0..M-1.  I suggest that you use hash(I) = (I*32821) % M

The number 32821 has some magic properties that spatter things about nicely.
See "Algorithms" by Sedgewick.

My hashing technique uses a table-size M which is always a power of two.
This makes the modulo-function (%) very fast on a two's compliment machine,
if you are willing to accept a tiny bit of machine-dependance. (I never
expect to use a machine that is not two's compliment, so I'm not too 
scared.)

You code the mod function as

	H & mask

rather than 

        H % table_size

where mask is table_size-1.

For a "rehash" function, rather that looking at the next slot, I use the
function defined by the following recursive definition:

         rehash(0,slot) = slot
         rehash(i+1, slot) = ((rehash(i,slot)+1)*3) mod M

When M is a power of two, this fuction cycles after precisely M/2 steps.
I have proved as much, but my proof is tedious. The range of the
sequence is the set of numbers in 0..M-1 which are congruent to
slot or slot+3 modulo 4.  Don't ask me how I discovered this; I think
it came to me in a vision.

When the table gets half full, I double its size and hash everything
into the new table.

The routines allow for removing entries, as well as adding them
and looking them up, but I have omited the removal routine, which is
quite tricky.

Here is an excerpt from the actual code: (For purposes of the telephone
number problem, we may assume that "entry" is a record which contains
a telephone number and some other stuff, and that obj->eq is a pointer
to a function that returns true when two entries have the same tele-number
in them and false otherwise.  obj->hash is a pointer to a function that 
multiplies the tele number in an entry by 32821.)

The following provably terminates because the rehash sequence hits
exactly half the slots, and the table is kept less than half full.

/* CAVEAT: The following assumes that integers are two's complement. */
#define HASH(cont) (((*(obj->hash))(cont)) & obj->mask )
#define REHASH(num)  (((((num)+1)*3) & obj->mask) )


Ptr
Hash_put(obj, entry )
  register Hash* obj;
  Ptr entry;
{

  register int bucket_number;
  register Ptr* bucket;

  bucket_number = HASH(entry);

  while(1)
    {
      bucket = obj->hash_table + bucket_number;

      if ( *bucket == (Ptr)0 )
	{ 
	  *bucket = entry;
	  obj->num_entries++;
	  if ( obj->num_entries > obj->max_entries )
	    Hash_overflow(obj);  /* double size of table */
	  return (Ptr)0;  /* <======== added new entry */
	}
      
      if ( !obj->eq( entry, *bucket ) )
	{ 
	  bucket_number = REHASH(bucket_number);
	  continue; /* <====== search some more (collision) */
	}

      /* Found old Ptr. Replace. */
      { 
	Ptr old = *bucket;
	*bucket = entry;
	return old; /* <============== replaced old entry */
      }

    }
  
}


Ptr
Hash_get(obj, entry )
  register Hash* obj;
  Ptr entry;
{

  register int bucket_number;
  register Ptr* bucket;

  bucket_number = HASH(entry);

  while(1)
    {
      bucket = obj->hash_table + bucket_number;

      if ( *bucket == (Ptr)0 )
	{ 
	  return (Ptr)0; /* <====== entry not found */
	}
      
      if ( !obj->eq( entry, *bucket) )
	{ 
	  bucket_number = REHASH(bucket_number);
	  continue; /* <====== search some more (collision) */
	}

        return *bucket; /* <====== found entry */
    }

}

zeeff@b-tech.UUCP (Jon Zeeff) (08/19/88)

In article <2550078@hpisod2.HP.COM> decot@hpisod2.HP.COM (Dave Decot) writes:
>> could anyone recommend a hashing algorithm to store a list of 10
>> digit phone numbers (area code plus 7 digits).
>Treat the ten digits as an integer, n, and compute the index
>as (n % HASH_SIZE), where HASH_SIZE is the size of your hash table.
>

I don't second Dave's recommendation.  What seems like a fairly random 
hash function usually doesn't work very well when you actually measure 
it's performance.  Here is one stolen from pathalias that works well.  

/* This is a simplified version of the pathalias hashing function.
 * Thanks to Steve Belovin and Peter Honeyman
 *
 * hash a string into a long int.  31 bit crc (from andrew appel).
 * the crc table is computed at run time by crcinit() -- we could
 * precompute, but it takes 1 clock tick on a 750.
 *
 * This fast table calculation works only if POLY is a prime polynomial
 * in the field of integers modulo 2.  Since the coefficients of a
 * 32-bit polynomial won't fit in a 32-bit word, the high-order bit is
 * implicit.  IT MUST ALSO BE THE CASE that the coefficients of orders
 * 31 down to 25 are zero.  Happily, we have candidates, from
 * E. J.  Watson, "Primitive Polynomials (Mod 2)", Math. Comp. 16 (1962):
 *	x^32 + x^7 + x^5 + x^3 + x^2 + x^1 + x^0
 *	x^31 + x^3 + x^0
 *
 * We reverse the bits to get:
 *	111101010000000000000000000000001 but drop the last 1
 *         f   5   0   0   0   0   0   0
 *	010010000000000000000000000000001 ditto, for 31-bit crc
 *	   4   8   0   0   0   0   0   0
 */

#define POLY 0x48000000L	/* 31-bit polynomial (avoids sign problems) */

static long CrcTable[128];

static void
crcinit()
{	register int i, j;
	register long sum;

	for (i = 0; i < 128; ++i) {
		sum = 0L;
		for (j = 7 - 1; j >= 0; --j)
			if (i & (1 << j))
				sum ^= POLY >> j;
		CrcTable[i] = sum;
	}
	DEBUG("crcinit: done\n");
}

static long
hash(name, size)
register char *name;
register int size;
{
	register long sum = 0L;

	while (size--) {
		sum = (sum >> 7) ^ CrcTable[(sum ^ (*name++)) & 0x7f];
	}
	DEBUG("hash: returns (%ld)\n", sum % INDEX_SIZE);
	return(sum % INDEX_SIZE);
}

-- 
Jon Zeeff           		Branch Technology,
uunet!umix!b-tech!zeeff  	zeeff%b-tech.uucp@umix.cc.umich.edu

bill@proxftl.UUCP (T. William Wells) (08/19/88)

In article <654@novavax.UUCP> raab@novavax.UUCP (Moshe Raab) writes:
: could anyone recommend a hashing algorithm to store a list of 10
: digit phone numbers (area code plus 7 digits).
: It should have little or no overflow and no collisions (as few as
: possible). The list will contain more than one area code but
: about 1000 numbers per area code (ie not a totasly random sample
: but one which has a relatively common prefix)
: thank you very much.

You might want to try a hash trie (sic).  The Programming Pearls
column in the June 1986 Communications of the ACM describes one
way to do this.  I have not tried this so I can give you no
estimates on the space needed to do it, but access to a hash trie
is very fast though perhaps not as fast as a very sparse hash
table.

rlb@polari.UUCP (rlb) (08/20/88)

In article <723@goofy.megatest.UUCP>, djones@megatest.UUCP (Dave Jones) writes:
> From article <2550078@hpisod2.HP.COM>, by decot@hpisod2.HP.COM (Dave Decot):
> >> could anyone recommend a hashing algorithm to store a list of 10
> >> digit phone numbers (area code plus 7 digits).
> >        ...
> > Treat the ten digits as an integer, n, and compute the index
> > as (n % HASH_SIZE), where HASH_SIZE is the size of your hash table.
> > 
At this point, no one has pointed out that modulo is a better hash
function if the table size is a prime number.
> 
> ...
> 
> I have devised a method for hashing which has not been published previously.
> It is very very fast. Here is the WORLD PREMIERE!!
> 
> First let's settle the business of how to hash integers into the
> range 0..M-1.  I suggest that you use hash(I) = (I*32821) % M
> 
> My hashing technique uses a table-size M which is always a power of two.
> This makes the modulo-function (%) very fast on a two's compliment machine,

Well, yes, but you're not counting the cost of multiplying by 32821 elsewhere,
so, in reality, your way seems to be a wee bit slower than (key % prime).

>     ...   long discussion/demonstration of rehashing technique ...
>
To summarize:  You've created a hashing scheme.  Forming the hash itself
isn't faster than the simple (key % prime) algorithm.  Re-hashing and
deletions require more code than the simple "put it in the next empty slot"
method.  It requires more storage than the simple algorithm (I'm assuming
it "breaks" if you let the table get more than half full, I didn't actually
figure it out).

The principal advantage seems to be that access time is flatter (not, flat,
but flatter, and probably not much flatter) than the simple "put it in the
next empty slot" method.  If you're not going to let the table get more than
half full, then (assuming a fairly successful hash) there's only about a 50%
chance that you'll have to rehash.  So, even with a bit of clumping, doesn't
it seem likely that the simplest rehashing scheme compares quite favorably
with the one you've created?
Agree?  Disagree?

daveb@llama.rtech.UUCP (Dave Brower) (08/23/88)

In article <4712@b-tech.UUCP> zeeff@b-tech.UUCP (Jon Zeeff) writes:
>In article <2550078@hpisod2.HP.COM> decot@hpisod2.HP.COM (Dave Decot) writes:
>>> could anyone recommend a hashing algorithm to store a list of 10
>>> digit phone numbers (area code plus 7 digits).
>>Treat the ten digits as an integer, n, and compute the index
>>as (n % HASH_SIZE), where HASH_SIZE is the size of your hash table.
>
>I don't second Dave's recommendation.  What seems like a fairly random
>hash function usually doesn't work very well when you actually measure
>it's performance.  Here is one stolen from pathalias that works well.
>
...
>static long
>hash(name, size)
>register char *name;
>register int size;
>{
>	register long sum = 0L;
>
>	while (size--) {
>		sum = (sum >> 7) ^ CrcTable[(sum ^ (*name++)) & 0x7f];
>	}
>	DEBUG("hash: returns (%ld)\n", sum % INDEX_SIZE);
>	return(sum % INDEX_SIZE);
>}
>

The konstant factors add up.  It is often the case that a better hash
isn't worth the effort.  The expression:

	sum = (sum >> 7) ^ CrcTable[(sum ^ (*name++)) & 0x7f];

take many more instructions to figure than the naive:

	sum += *name++;

If the collision chains generated by the naive function are short, it
may be cheaper to search the chain than use a smarter hash.

What I found in some real application, with about 3000 entries in the
table, was that naive beat smart in real execution time.   The naive
collision chains were typically 4-7 entries long.  I could not reduce
the time spent in the smart hash function enough to make it worth using.

The wall was high enough to get me to try to use splay trees instead of
hashing.  When it was all over, my actual cpu figures were the same for
tree and hash, giving ordering in the tree for free.

Moral:  Try several ways and *measure*.  The results may be surprising.

I posted the tree library in comp.source.unix Archive-name v15i005. 
Thanks to Doug Jones@uiowa for his original Pascal implementation.

-dB
"Ready when you are Raoul!"
{amdahl, cpsc6a, mtxinu, sun, hoptoad}!rtech!daveb daveb@rtech.com <- FINALLY!

daveb@llama.rtech.UUCP (Dave Brower) (08/23/88)

In article <2392@rtech.rtech.com> daveb@llama.UUCP I say:
>
>I posted the tree library in comp.source.unix Archive-name v15i005.
>Thanks to Doug Jones@uiowa for his original Pascal implementation.

Lies!  It is really:

	Subject: v14i087:  Splay tree library
	Posting-number: Volume 14, Issue 87
	Archive-name: splay-tree

-dB




"Ready when you are Raoul!"
{amdahl, cpsc6a, mtxinu, sun, hoptoad}!rtech!daveb daveb@rtech.com <- FINALLY!

djones@megatest.UUCP (Dave Jones) (08/23/88)

I recently posted an original hash-table routine here. It is the result 
of a great deal of research and experimentation.  I never expected anyone 
to say, "Thanks," and sure enough, no one has. That's okay.

This note is just to indicate that I do not intend to respond to postings
discussing the relative values of hashing methods, although I would
be happy to see other algorithms, especially compilable C code.
It's a fascinating subject, which has been the topic of extensive research
over the years, but I just don't have the time or inclination right
now.

When I did the comparisons between various methods, I did not save the 
statistics, and I have no interest in repeating them.  Nor would they 
necessarily be applicable to all machines, if I still had them.  The
constraints that I was under were roughly these:

   0. Multiple hash-table "objects" was essential.

   1. Multiplication by a variable was slow.

   2. Division by a variable was about four times as slow as 
      multiplication by a variable.

   3. Multiplication by constant 3 was implemented by the compiler as
      a shift and add and was therefore fast.

   4. Malloc() was too slow to consider.

   5. The machine was, and was likely to remain forever, a two's
      complement machine.

pardo@june.cs.washington.edu (David Keppel) (08/24/88)

djones@megatest.UUCP (Dave Jones) writes:
>[ posted hash algorithm, no thanks from anybody ]

Personally, I don't have any use for hashing right now (but I'll
probably be back to you in a few weeks :-), I *do* appreciate various
source postings to various groups.  They encourage me to expand my
freeware libraries and help me write better code, faster.

I hope I speak for the net when I say "thanks for being thoughtful
enough to share your code".

	;-D on  ( Freeware is free where the free wear )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

henry@utzoo.uucp (Henry Spencer) (08/24/88)

In article <2392@rtech.rtech.com> daveb@llama.UUCP (Dave Brower) writes:
>If the collision chains generated by the naive function are short, it
>may be cheaper to search the chain than use a smarter hash.
>
>What I found in some real application, with about 3000 entries in the
>table, was that naive beat smart in real execution time.   The naive
>collision chains were typically 4-7 entries long.  I could not reduce
>the time spent in the smart hash function enough to make it worth using.

I second this.  Those of you who have looked over that masterpiece of
brilliant coding :-), the C News expire, will have noticed that it hashes
newsgroups simply by the length of the name.  Crude and inelegant... but
the longest collision chain is only a handful of entries long, and none
of the fancier hash functions did significantly better.
-- 
Intel CPUs are not defective,  |     Henry Spencer at U of Toronto Zoology
they just act that way.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

lvc@cbnews.ATT.COM (Lawrence V. Cipriani) (08/25/88)

In article <4712@b-tech.UUCP>, zeeff@b-tech.UUCP (Jon Zeeff) writes:
> /* This is a simplified version of the pathalias hashing function.
>  * Thanks to Steve Belovin and Peter Honeyman

> static long
> hash(name, size)
> register char *name;
> register int size;
> {
> 	register long sum = 0L;
> 
> 	while (size--) {
> 		sum = (sum >> 7) ^ CrcTable[(sum ^ (*name++)) & 0x7f];
> 	}
> 	DEBUG("hash: returns (%ld)\n", sum % INDEX_SIZE);
> 	return(sum % INDEX_SIZE);
> }

If you change the routine to this:

static long
hash(name, size, sum)
register char *name;
register int size, sum;
{
	while (size--) {
		sum = (sum >> 7) ^ CrcTable[(sum ^ (*name++)) & 0x7f];
	}
	DEBUG("hash: returns (%ld)\n", sum % INDEX_SIZE);
	return(sum % INDEX_SIZE);
}

where the initial value of sum is 0.  With the change you can
use hash() with data objects that won't fit in memory.  That is:

	int sum = 0; char iobuf[BUFSUZ];
	while ((nbytes = read(fd, iobuf, sizeof iobuf)) > 0)
		sum = hash(iobuf, nbytes, sum);

After the last call to hash, the value of sum is the sum for the
whole file.  This is not an original idea of mine.  I saw it in
a piece of 10 year old code.

-- 
Larry Cipriani, AT&T Network Systems, Columbus OH, cbnews!lvc lvc@cbnews.ATT.COM

ray@micomvax.UUCP (Ray Dunn) (09/01/88)

In article <2392@rtech.rtech.com> daveb@llama.UUCP (Dave Brower) writes:
 > ...
 > If the collision chains generated by the naive function are short, it
 > may be cheaper to search the chain than use a smarter hash.
 > ...

Definitely!

 > Moral:  Try several ways and *measure*.  The results may be surprising.

Better Moral: Whenever possible analyze the properties and distributions of
the data you want to handle and *then* design the hashing algorithm.  At
that point it is then reasonable to measure and tinker.

This discussion is getting very like those on sorting - please realize there
is no *general* best method,  there are only best methods in specific
application environments.

The goal should not normally be to minimize the search time for an
individual item, but to minimize the total time spent searching in a given
run session or series of runs, of the application - a fact often ignored in
discussions of minimizing chain lengths etc..

If the tokens "(", ")", ";" and "," each occur (i.e.  require to be
"looked-up") approximately n times more frequently than the tokens "while",
"for", "switch" and "do",  then wouldn't it be reasonable to design your
algorithm so that the former items were located approximately n times faster
than the latter?

This can be achieved in a variety of ways, including, let us not forget, ad
hoc tests for specific high frequency items prior to applying a generalized
routine.

In the case of a fixed data-base, for example the reserved words in a
compiler, it does not take much effort to analyze a filesystem or three for
token frequencies and then structure and *order* the dictionary accordingly.

-- 
Ray Dunn.                      |   UUCP: ..!philabs!micomvax!ray
Philips Electronics Ltd.       |   TEL : (514) 744-8200   Ext: 2347
600 Dr Frederik Philips Blvd   |   FAX : (514) 744-6455
St Laurent. Quebec.  H4M 2S9   |   TLX : 05-824090

gregs@hpmtlx.HP.COM ($Greg Stander) (09/10/88)

/ hpmtlx:comp.lang.c / zeeff@b-tech.UUCP (Jon Zeeff) /  6:00 pm  Aug 18, 1988 /
In article <2550078@hpisod2.HP.COM> decot@hpisod2.HP.COM (Dave Decot) writes:
>> could anyone recommend a hashing algorithm to store a list of 10
>> digit phone numbers (area code plus 7 digits).
>Treat the ten digits as an integer, n, and compute the index
>as (n % HASH_SIZE), where HASH_SIZE is the size of your hash table.
>

I don't second Dave's recommendation.  What seems like a fairly random 
hash function usually doesn't work very well when you actually measure 
it's performance.  Here is one stolen from pathalias that works well.  

/* This is a simplified version of the pathalias hashing function.
 * Thanks to Steve Belovin and Peter Honeyman
 *
 * hash a string into a long int.  31 bit crc (from andrew appel).
 * the crc table is computed at run time by crcinit() -- we could
 * precompute, but it takes 1 clock tick on a 750.
 *
 * This fast table calculation works only if POLY is a prime polynomial
 * in the field of integers modulo 2.  Since the coefficients of a
 * 32-bit polynomial won't fit in a 32-bit word, the high-order bit is
 * implicit.  IT MUST ALSO BE THE CASE that the coefficients of orders
 * 31 down to 25 are zero.  Happily, we have candidates, from
 * E. J.  Watson, "Primitive Polynomials (Mod 2)", Math. Comp. 16 (1962):
 *	x^32 + x^7 + x^5 + x^3 + x^2 + x^1 + x^0
 *	x^31 + x^3 + x^0
 *
 * We reverse the bits to get:
 *	111101010000000000000000000000001 but drop the last 1
 *         f   5   0   0   0   0   0   0
 *	010010000000000000000000000000001 ditto, for 31-bit crc
 *	   4   8   0   0   0   0   0   0
 */

#define POLY 0x48000000L	/* 31-bit polynomial (avoids sign problems) */

static long CrcTable[128];

static void
crcinit()
{	register int i, j;
	register long sum;

	for (i = 0; i < 128; ++i) {
		sum = 0L;
		for (j = 7 - 1; j >= 0; --j)
			if (i & (1 << j))
				sum ^= POLY >> j;
		CrcTable[i] = sum;
	}
	DEBUG("crcinit: done\n");
}

static long
hash(name, size)
register char *name;
register int size;
{
	register long sum = 0L;

	while (size--) {
		sum = (sum >> 7) ^ CrcTable[(sum ^ (*name++)) & 0x7f];
	}
	DEBUG("hash: returns (%ld)\n", sum % INDEX_SIZE);
	return(sum % INDEX_SIZE);
}

-- 
Jon Zeeff           		Branch Technology,
uunet!umix!b-tech!zeeff  	zeeff%b-tech.uucp@umix.cc.umich.edu
----------