[comp.lang.c] hash function for mac needed

bkottmann@falcon.aamrl.wpafb.af.mil (Brett Kottmann) (06/11/91)

	Are there C implementations of hash functions lying around out there? 
I need a hash function that will take a string (up to 255 chars + null) and
make a unique key for it.  It should hash to the same key for the same string
:).
	Also, I'm using a Mac, so no UNIX/DOS library pointers, please!

	Thanx,

Brett
=============================OFFICIAL=DISCLAIMER================================
The opinions and views expressed here are strictly my own and do not 
necessarily reflect the official position of either the U.S. Air Force 
or its contractors.
=====================DO=NOT=REMOVE=TAG=UNDER=PENALTY=OF=LAW===:)================

Dave.Harris@f14.n15.z1.fidonet.org (Dave Harris) (06/16/91)

 >From: bkottmann@falcon.aamrl.wpafb.af.mil (Brett Kottmann)
 >Date: 10 Jun 91 21:08:47 GMT
 >Organization: Logicon Technical Services, Inc.
 >Message-ID: <1991Jun10.160847.134@falcon.aamrl.wpafb.af.mil>
 >Newsgroups: comp.lang.c


 >        Are there C implementations of hash functions lying around out 
 >there? 
 >I need a hash function that will take a string (up to 255 chars + null) 
 >and
 >make a unique key for it.  It should hash to the same key for the same 
 >string
 >:).

When somebody compresses 255 bytes down to 2 or 4 let me know :).  "Unique" 
isn't exactly the word you want I don't think.  The simple thing to do is to 
add all the characters up then mod by the size of your hash, then use that 
number to index into an array.  If the spot is used, either link list off the 
array location or skip cirularly in the array until you hit the next unused 
location.  You don't necessarly need to add all 255 characters together.  
Every 5th, 10th or whatever may work just as well, depending on the data.  
Simply keep a counter of the number of seeks required to find your hash 
locations so you can determine which hash scheme works with the fewest seeks 
for your particular data.

To anybody else.  If I have n data elements, with a hash of size x, what is 
considered a good number of searches based on n and x (i.e. ln(x/n)).  Thanks
 

Dave Harris. 


 

--  
Uucp: ...{gatech,ames,rutgers}!ncar!asuvax!stjhmc!15!14!Dave.Harris
Internet: Dave.Harris@f14.n15.z1.fidonet.org

torek@elf.ee.lbl.gov (Chris Torek) (06/18/91)

In article <14207.285B768A@stjhmc.fidonet.org>
Dave.Harris@f14.n15.z1.fidonet.org (Dave Harris) writes:
>... The simple thing to do is to add all the characters up then mod by
>the size of your hash, then use that number to index into an array. ...
>You don't necessarly need to add all 255 characters together.  

Indeed: as it turns out, while many people have spent much time looking
for `good' hashing functions, fewer seem to have noted that many
hashing systems spend more time computing hashes than searching.  I
find that this is usually the case, and that a fast hash function is
better than a slower-but-better-distribution function.  One good hash
function, useful in many applications, is this:

	hash = 0;
	for (each character)
		hash = (hash * 33) + (the character);
	hash_index = hash & ((some power of two) - 1);

In C, this might be written as

	struct hashent *
	lookup(table, string)
		register struct hashtab *table;
		char *string;
	{
		register unsigned hash = 0;
		register char *p, c;
		register struct hashent *hp;

		for (p = string; (c = *p) != '\0'; p++)
			hash = (hash << 5) + hash + c;
		p = string;
		for (hp = table->ht_entries[hash & table->ht_mask];
		     hp != NULL; hp = hp->h_next)
			if (hp->h_hash == hash && strcmp(hp->h_key, p) == 0)
				return (hp);
		return (NULL);
	}

>To anybody else.  If I have n data elements, with a hash of size x, what is 
>considered a good number of searches based on n and x (i.e. ln(x/n)).

If you have a perfectly uniform distribution, each chain or hash
sequence will have n/x elements; assuming the probability of any
element lookup is identical, this is the best that can be done.  I do
not know what others consider `good', but a factor of 1.3 or so over
that seems reasonable to me (I cannot quite think why).

Typically neither assumption holds true.  There are too many ways to
take advantage of this to characterize here.  One thing that can be
done, however, is to use the above hash table scheme with expanding
tables.  As the number of entries increases, the hash table size can be
doubled, and table->ht_mask adds one bit:

	table->ht_mask = (table->ht_mask << 1) | 1;

Since the hash values of each <key,datum> pair are maintained, it is
easy to move the entries from their old hash slots to their new ones.
Doubling occurs O(log2 n) times if the doubling criterion is total
number of table entries; other doubling criteria, such as maximum chain
length, are possible but rarely worthwhile---one ends up spending more
time optimizing hashes than searching, just as with more complicated
hash functions.  Trading space for time this way seems to work well,
and you can select the amount of space by selecting the doubling factor.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

thomasw@hpcupt1.cup.hp.com (Thomas Wang) (06/19/91)

Read CACM June 1990, page 677 "Fast Hashing of Variable-Length Text Strings".


 -Thomas Wang
              (Everything is an object.)
                                                     wang@hpdmsjlm.cup.hp.com
                                                     thomasw@hpcupt1.cup.hp.com

rh@smds.UUCP (Richard Harter) (06/19/91)

In article <14396@dog.ee.lbl.gov>, torek@elf.ee.lbl.gov (Chris Torek) writes:

> Indeed: as it turns out, while many people have spent much time looking
> for `good' hashing functions, fewer seem to have noted that many
> hashing systems spend more time computing hashes than searching.  I
> find that this is usually the case, and that a fast hash function is
> better than a slower-but-better-distribution function.  One good hash
> function, useful in many applications, is this:

> 	hash = 0;
> 	for (each character)
> 		hash = (hash * 33) + (the character);
> 	hash_index = hash & ((some power of two) - 1);

And it may well pay to be even more radical in trading speed for 'perfection'.
In one of my programs I found it quite profitable to only look at the last
ten characters of a string, summing all but the last two, and doing shift
and sum on the last two.  E.g. if n is the string length

	hash = 0;
	cc = string + n;
	switch (n>10 ? 10 : n) {
		case 10: hash += (int)*--cc;
		case  9: hash += (int)*--cc;
		case  8: hash += (int)*--cc;
		case  7: hash += (int)*--cc;
		case  6: hash += (int)*--cc;
		case  5: hash += (int)*--cc;
		case  4: hash += (int)*--cc;
		case  3: hash += (int)*--cc;
		case  2: hash += ((int)*--cc)<<1;
		case  1: hash += ((int)*--cc)<<3;
		default: break;
		}
	hash &=  1023;

Admittedly this appears crude, but in practice the last ten bytes were
all that were really needed to distinguish most strings (in the application
of interest, of course.)  Nor was it really needful to do anything more
than summing bytes.  I should note that the string length was stored in
the table; in effect it served as a secondary key for the bucket search
loop.  The speed up was quite sharp.
-- 
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398 
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb.  This sentence short.  This signature done.

henry@zoo.toronto.edu (Henry Spencer) (06/19/91)

In article <567@smds.UUCP> rh@smds.UUCP (Richard Harter) writes:
>> ... a fast hash function is
>> better than a slower-but-better-distribution function...
>In one of my programs I found it quite profitable to only look at the last
>ten characters of a string...
>Admittedly this appears crude, but in practice the last ten bytes were
>all that were really needed...

In fact, when I originally did the hashed newsgroup-lookup code in C News
expire, a quick investigation did not reveal any hashing function which
had overall performance much better than simply using the length of the
string!  (The newsgroup list has grown so much that this decision is in
need of revision, mind you.)
-- 
"We're thinking about upgrading from    | Henry Spencer @ U of Toronto Zoology
SunOS 4.1.1 to SunOS 3.5."              |  henry@zoo.toronto.edu  utzoo!henry

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/21/91)

In article <14396@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
> 		hash = (hash * 33) + (the character);

I see I've converted you from hash * 31 to hash * 33. :-)

  [ on the typical distribution of hash values with chaining ]

It's worth remembering the power series for exp. If your load factor---
that is, the number N of stored elements divided by the number of hash
values---is x, then there will be about exp(-x)N zero-element chains,
exp(-x)Nx one-element chains, exp(-x)Nx^2/2 two-element, exp(-x)Nx^3/3!
three-element, and so on. This holds up quite well in practice, and lets
you judge accurately how often a branch will be taken or how far it's
worth unrolling a loop when you're working with hash tables.

---Dan

Dave.Harris@f14.n15.z1.fidonet.org (Dave Harris) (06/21/91)

In a message of <Jun 18 20:29>, Chris Torek (1:114/15) writes: 

 >Indeed: as it turns out, while many people have spent much time looking
 >for `good' hashing functions, fewer seem to have noted that many
 >hashing systems spend more time computing hashes than searching.  I
 >find that this is usually the case, and that a fast hash function is
 >better than a slower-but-better-distribution function.  One good hash
 >function, useful in many applications, is this:

 >        hash = 0;
 >        for (each character)
 >                hash = (hash * 33) + (the character);
 >        hash_index = hash & ((some power of two) - 1);

Is the *33 use to keep short sequences has key more random or does it really 
help in the distribution of large string too?

  

 


 

--  
Uucp: ...{gatech,ames,rutgers}!ncar!asuvax!stjhmc!15!14!Dave.Harris
Internet: Dave.Harris@f14.n15.z1.fidonet.org

rsalz@bbn.com (Rich Salz) (06/22/91)

In <1991Jun19.161959.13677@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
|In fact, when I originally did the hashed newsgroup-lookup code in C News
|expire, a quick investigation did not reveal any hashing function which
|had overall performance much better than simply using the length of the
|string!  (The newsgroup list has grown so much that this decision is in
|need of revision, mind you.)

I think you'll want to re-check.  The distributions are real uneven -- there
are more than 40 newsgroups of length 10, for example, so you have to resolve
lots of conflicts, and if you don't put a secondary hash, that means lots of
strcmp's.  Of course, the design is very dependant on the newsgroup that a
site receives and the distribution of articles within those sites.

I use Chris Torek's hash function (amazingly good distribution and amazingly
cheap to calculate) with a table size of 128 and sort the conflicts based on
the highest article number.  It seems to give the biggest win for the biggest
number of systems.
	/r$
-- 
Please send comp.sources.unix-related mail to rsalz@uunet.uu.net.
Use a domain-based address or give alternate paths, or you may lose out.

henry@zoo.toronto.edu (Henry Spencer) (06/25/91)

In article <3634@litchi.bbn.com> rsalz@bbn.com (Rich Salz) writes:
>|In fact, when I originally did the hashed newsgroup-lookup code in C News
>|expire, a quick investigation did not reveal any hashing function which
>|had overall performance much better than simply using the length of the
>|string! ...
>
>I think you'll want to re-check.  The distributions are real uneven -- there
>are more than 40 newsgroups of length 10, for example, so you have to resolve
>lots of conflicts, and if you don't put a secondary hash, that means lots of
>strcmp's...

Strcmps cost much less if you prefix them with an inline check of the
first character.  This strategy isn't as effective with newsgroups (which
tend to have common prefixes) as with random strings, but it still helps
quite a bit.  And *as I mentioned*, the above-mentioned results are out
of date; the hash chains were typically 2-3 long back then.
-- 
"We're thinking about upgrading from    | Henry Spencer @ U of Toronto Zoology
SunOS 4.1.1 to SunOS 3.5."              |  henry@zoo.toronto.edu  utzoo!henry

rsalz@bbn.com (Rich Salz) (06/29/91)

I do test the first character, Henry -- I'm pretty good at taking code and ideas
from experts.  The Winter 87 paper "News Need Not Be Slow" that you and Geoff
wrote was the first reference I've seen to how effective that can be.

I apologize if my "you'll want to recheck" made it sound like a smack on
the head.  Of course your article said your data was known to be old.  I
just meant to add a data point of my own (for what my points are worth)
agreeing that things are very different now.

We now return you to your standard "what does a||b do" discussion...
	/r$
-- 
Please send comp.sources.unix-related mail to rsalz@uunet.uu.net.
Use a domain-based address or give alternate paths, or you may lose out.