[comp.lang.c] hash function

torek@elf.ee.lbl.gov (Chris Torek) (06/22/91)

In an article whose referent was deleted by bogus fidonet software
(and whose tabs were changed to spaces), I provided this example
hashing function:
>>	hash = 0;
>>	for (each character)
>>		hash = (hash * 33) + (the character);
>>	hash_index = hash & ((some power of two) - 1);

In article <14491.28619071@stjhmc.fidonet.org>
Dave.Harris@f14.n15.z1.fidonet.org (Dave Harris) writes:
>Is the *33 use to keep short sequences has key more random or does it
>really help in the distribution of large string too?

I am not sure what you are asking here.  There are basically three
questions you can ask about a hash function:

 a) Is it fast enough?  (complicated functions often take longer than
    the searches they replace)

 b) Does it produce a good distribution?  (the hash function
    h(input)=0 is very fast but produces horrible distributions)

 c) If (b) is true, why does it work?

The first two can really only be answered in some specific context.
In all the contexts I have tried, this one answers `yes' to both
(which is why I recommend it).

The third question is the hardest.  What *does* that 33 do?  I have no
idea.  All I know is that I have experimented with `easy' functions
(functions for which `fast enough' should generally be true) and this
one also produces a good distribution.  I doubt I still have the mail,
but when Margo Seltzer was writing the database code to replace ndbm,
she tried a bunch of different functions on a bunch of different data.
This one usually worked well---it did not always give the best
distributions, but it never performed poorly; some functions that gave
better distributions on some data gave much worse on other.

Maybe there are some theory people out there who can explain it.
Probably some staring at Knuth vol. 3 would help.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

kpv@ulysses.att.com (Phong Vo[drew]) (06/24/91)

Somewhere in this thread, Chris Torek proposed as a hash function:
> >>	hash = 0;
> >>	for (each character)
> >>		hash = (hash * 33) + (the character);
> >>	hash_index = hash & ((some power of two) - 1);
> 
Chris then asked,
> ...  What *does* that 33 do?
> 
> Maybe there are some theory people out there who can explain it.
> Probably some staring at Knuth vol. 3 would help.

A good property for a hash function is if you keep feeding it
a constant character, it performs as if it's a random number generator.
The hash function f(h,c) = 33*h + c is derived from a class of random
number generators called linear congruential. At the start of Knuth Vol 2,
there is a good discussion of these generators. Let's apply some of the
theorems in that section and see what we get. Keep in mind that we are working
with random numbers mod 2^x where x is usually 16 or 32. Assuming that c is some
fixed odd value, Theorem A guarantees that 33 is good because (33-1)%4 == 0.
In the same reasoning, 31 is not good since (31-1)%4 == 2 != 0.
Here good means that if you keep feeding c, you'll cycle thru the entire
set of values [0,2^x). On the other hand, 33 is not a primitive element for 2^x
when x > 3 (Theorem C), so it isn't a good multiplicator if you keep feeding in 0's.
For example, 37 is a better value from that point of view. To summarize,
a good multiplicator is any number with 101 as their low order 3 bits.

At this point, we depart from theory. We typically hash byte strings and we want
the bits to mix in the 16-bit or 32-bit registers relatively fast. This suggests
that some value larger than 2^8 (but not too large) may do better as a multiplicator.
After a little experimentation, here's a pretty decent hash function:
	hash = 0;
	for(each byte c)
		hash = (hash<<8) + (hash<<2) + hash + c;
This corresponds to the multiplicator 261 whose bit pattern is 100000101.
Try it, you'll like it.

torek@elf.ee.lbl.gov (Chris Torek) (06/24/91)

In an article whose referent is long gone, I described this hash function:
	for (each character)
		hash = (hash * 33) + (the character);
	hash_index = hash & ((some power of two) - 1);

and noted:
>>Maybe there are some theory people out there who can explain [why 33
>>works well].

In article <15027@ulysses.att.com> kpv@ulysses.att.com (Phong Vo[drew]) writes:
>A good property for a hash function is if you keep feeding it
>a constant character, it performs as if it's a random number generator.
>The hash function f(h,c) = 33*h + c is derived from a class of random
>number generators called linear congruential. At the start of Knuth Vol 2,
>there is a good discussion of these generators.

... and indeed, if you turn to Knuth vol. 2, chapter 3, you will find all
sorts of interesting stuff, such as a proof that the low-order n bits of
the usual rand() repeat with period $2^n$.  (rand() uses a modulus of
either $2^15$ or $2^31$.)

>... Assuming that c is some fixed odd value, Theorem A guarantees that
>33 is good because (33-1)%4 == 0. ... Here good means that if you keep
>feeding c, you'll cycle thru the entire set of values [0,2^x).

This is approaching what I was getting at, but is not quite there.
Basically it says that 33 is `good' because it is `not bad'.  An LCRNG
is defined as

	h    = ( a h  +  c ) mod m	(where h  is each a value of the
	 n+1        n				n

					 variable `hash' in the for loop)

(in quote `>' above, $m = 2^x$).  With a `bad' multiplier for $a$---and
what is `bad' depends on $c$ and $m$---the generator will not produce all
$m$ possible values; with a `good' one it will.  Intuitively, then, the
`bad' one will miss some of the hash buckets entirely; the `good' one
will not.

But *why* is it true that `A good property is [that] it performs as if
it's a [good] random number generator'?  This is now `obvious' (from the
above result) *if* we feed in constant values of $c$, i.e., if all the
strings are made of the same characters, but are different lengths.  But
we rarely do that---the strings we hash are full of weird junk.  How do
LCRNGs get in there?  Why do the varying values of $c$ not mess things
up?  (Of course, they do, but not enough to matter.  Why not?)

>At this point, we depart from theory. We typically hash byte strings ...

Actually, I typically hash ASCII text.  33 is the `easiest' `good'
multiplier (as defined above) that is >= 26.  I have no idea whether
this is significant, or merely coincidental.

>... This suggests that some value larger than 2^8 (but not too large)
>may do better as a multiplicator. ...
>	hash = 0;
>	for(each byte c)
>		hash = (hash<<8) + (hash<<2) + hash + c;

I will see if I can get this one run on the same data as the `33 hash'.
This is more work than

		hash = (hash << 5) + hash + c;

and distribution improvements may be outweighed by additional hash time
(or may not: who knows?).
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/25/91)

In article <15027@ulysses.att.com> kpv@ulysses.att.com (Phong Vo[drew]) writes:
> A good property for a hash function is if you keep feeding it
> a constant character, it performs as if it's a random number generator.

No, I don't think so. This is commonly true of good hash functions and
commonly false of bad ones, but keep in mind that pseudo-random
sequences have to keep working 100, 1000, 1000000 elements later, while
simple string hash functions are rarely applied to strings with more
than 80 characters.

  [ as a generator mod 2^n, 37 is better than 33 is better than 31 ]

Again, I don't think this is accurate. Each number has a sufficiently
long period mod 2^n that you don't have to worry about repeats.

> We typically hash byte strings and we want
> the bits to mix in the 16-bit or 32-bit registers relatively fast.
  [ so use a multipler of 261 ]

Again, practically any good multiplier works. I think you're worrying
about the fact that 31c + d doesn't cover any reasonable range of hash
values if c and d are between 0 and 255. That's why, when I discovered
the 33 hash function and started using it in my compressors, I started
with a hash value of 5381. I think you'll find that this does just as
well as a 261 multiplier.

---Dan

kpv@ulysses.att.com (Phong Vo[drew]) (06/26/91)

> Again, practically any good multiplier works. I think you're worrying
> about the fact that 31c + d doesn't cover any reasonable range of hash
> values if c and d are between 0 and 255. That's why, when I discovered
> the 33 hash function and started using it in my compressors, I started
> with a hash value of 5381. I think you'll find that this does just as
> well as a 261 multiplier.
> 
> ---Dan

261 is just a number that was picked out of a hat that satisfies
the conditions outlined in the theorems in Knuth. These conditions are
pretty minimal so yes there are lots and lots of numbers that would do
just as well. For compressors, if you work with binary files, it isn't
infrequent to have strings of 0's. In such a case, you'll want the multiplier
to have the nice properties of a RNG. For this reason, 261 type of numbers
may be a little better than 33 (in either case, don`t forget to start the
hash sum at some odd value!).

For everyone's entertainment, I did a simple (and admittedly contrived) test
of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.
This is done by treating each number which is stored in a word as a
string of 4 bytes. With 261, 927300 of these numbers are
in their own buckets, the rest are in 36350 buckets each with 2 elements.
This is almost as good as a perfect hash function. The result for 33 is
much worse. The largest bucket has 64 elements in it and there are 4050
such buckets. The key difference is the 8th bit in 261 which causes the
mixing to go a little faster.

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/26/91)

In article <15047@ulysses.att.com> kpv@ulysses.att.com (Phong Vo[drew]) writes:
> For everyone's entertainment, I did a simple (and admittedly contrived) test
> of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.
> This is done by treating each number which is stored in a word as a
> string of 4 bytes.
  [ results supposedly showing 261 much better than 33 ]

Sorry, but the mod has to be a power of 2. Also, your test is remarkably
contrived: we're talking about string hash functions, and typical
strings (a) are concentrated on a few character values, not spread
evenly over the entire character set; (b) do not all have the same high
byte; (c) do not all have the same length.

Surely you noticed that a multiplier of 256 will produce perfect hashing
in this case. Does that mean you're recommending a multiplier of 256?

The profiling statistics I've saved from various compressors form a much
more realistic sample. They show 33 as slightly better than random
hashing on typical files, 37 as slightly worse. I've never tried 261 or
31, but I'll bet 261 does worse than 33 on text files.

---Dan

evans@syd.dit.CSIRO.AU (Bruce.Evans) (06/26/91)

In article <14583@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
>>>	hash = 0;
>>>	for (each character)
>>>		hash = (hash * 33) + (the character);
>>>	hash_index = hash & ((some power of two) - 1);
>
>In article <14491.28619071@stjhmc.fidonet.org>
>Dave.Harris@f14.n15.z1.fidonet.org (Dave Harris) writes:
>>Is the *33 use to keep short sequences has key more random or does it
>>really help in the distribution of large string too?
>
>The third question is the hardest.  What *does* that 33 do?  I have no
>idea.  All I know is that I have experimented with `easy' functions

I don't think there are any theoretical mysteries behind the 33. 33 = 32 + 1.
The 32 is to avoid mixing bits with previous hash bits. 256 would avoid all
mixing (with 8-bit chars) but it would throw away too many bits. The 1 is to
avoid throwing away bits. Rotating the bits instead of shifting left would
probably be a better way to avoid this.

The following hash function worked well for me in an assembler. It implements
the ideas that hashing all the chars in the string is unnecessary (and maybe
even bad if too many bits get shifted out) and that the middle char is more
random than leading or trailing chars.

In the program this was extracted from, the string length was already known.
Avoiding looking at all the chars in the string is not such a good idea when
strlen has to do it anyway. Also, this probably does best with a lot of
short strings.
---
extern unsigned strlen();	/* #include <string.h> */

#define SPTSIZE	1024

extern struct sym_s *spt[SPTSIZE];

#define hconv(ch) ((unsigned char) (ch) - 0x41)	/* better form for hashing */

struct sym_s **gethashptr(str)
register char *str;
{
    register unsigned hashval;
    register unsigned length;

    /* Hash function is a weighted xor of 1 to 4 chars in the string.
     * This seems to work better than looking at all the chars.
     * It is important that the function be fast.
     * The multiplication by MULTIPLIER should compile as a shift.
     */

#define MULTIPLIER (SPTSIZE / (1 << USEFUL_BITS_IN_ASCII))
#define USEFUL_BITS_IN_ASCII 6

    length = strlen(str);
    if (length <= 3)
    {
	if (length <= 2)
	    hashval  = hconv(str[-1]) * MULTIPLIER;
	else
	    hashval  = hconv(str[-2]) * MULTIPLIER,
	    hashval ^= hconv(str[-1]);
    }
    else
	hashval  = hconv(str[-(length / 2)]) * MULTIPLIER,
	hashval ^= hconv(str[-2]) << 2,
	hashval ^= hconv(str[-1]);
    return spt + (hashval ^ (hconv(str[0]) << 1)) % SPTSIZE;
}
---
-- 
Bruce Evans		evans@syd.dit.csiro.au

dik@cwi.nl (Dik T. Winter) (06/26/91)

In article <17918.Jun2608.17.5091@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
 > In article <15047@ulysses.att.com> kpv@ulysses.att.com (Phong Vo[drew]) writes:
 > > For everyone's entertainment, I did a simple (and admittedly contrived) test
 > > of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.
 > > This is done by treating each number which is stored in a word as a
 > > string of 4 bytes.
 > 
 > Sorry, but the mod has to be a power of 2.
Why?  (If you are telling me for speed I'll scream.)
 > 
 > Surely you noticed that a multiplier of 256 will produce perfect hashing
 > in this case. Does that mean you're recommending a multiplier of 256?
Only on big-endian machines of course :-)
 > 
 > The profiling statistics I've saved from various compressors form a much
 > more realistic sample. They show 33 as slightly better than random
 > hashing on typical files, 37 as slightly worse. I've never tried 261 or
 > 31, but I'll bet 261 does worse than 33 on text files.
 > 
I did some simple tests, hashing all entries of /usr/dict/words, counting
how many entries fall in each bucket.  I did this with the four multipliers
Dan just mentioned.  I did this for 16 buckets to 8192 buckets (only powers
of two done).  I calculated also the standard deviations involved.  I will
not bore you with all those digits, the result can be stated simply: it does
not matter very much what multiplier you use.  For each multiplier there
is a number of buckets where that multiplier has the smallest standard
deviation.  So use whatever you like.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

martin@adpplz.UUCP (Martin Golding) (06/27/91)

In <15047@ulysses.att.com> kpv@ulysses.att.com (Phong Vo[drew]) writes:

>For everyone's entertainment, I did a simple (and admittedly contrived) test
>of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.
>This is done by treating each number which is stored in a word as a
>string of 4 bytes. With 261, 927300 of these numbers are
>in their own buckets, the rest are in 36350 buckets each with 2 elements.
>This is almost as good as a perfect hash function. The result for 33 is
>much worse. The largest bucket has 64 elements in it and there are 4050
>such buckets. The key difference is the 8th bit in 261 which causes the
>mixing to go a little faster.

Your test is probably irrelevant, being contiguous sequential numbers.
The best hash in this case is
for( hash = (i = 0);  i < len(numstr); hash = hash*10 + numstring[i++]);
hash = hash % numberofbuckets;

This will yield the best possible distribution for any number of hash 
buckets. The proof is left to the reader.

The point is not that there is a better hash for sequential numbers, but
that the performance of _any_ hash function is highly dependent on your keys;
if you any idea of the values or kinds of values to expect you can choose
a better function. If you can _control_ the values of your keys you can
sometimes get a _much_ better function; viz the numbers above.

If all this is old and boring stuff, I apologise for interrupting you all
just to look like an idiot (flame me by email), and I now return you to
your regularly scheduled newsgroup.

Martin Golding    | sync, sync, sync, sank ... sunk:
Dod #0236         |  He who steals my code steals trash.
A poor old decrepit Pick programmer. Sympathize at:
{mcspdx,pdxgate}!adpplz!martin or martin@adpplz.uucp

martin@adpplz.UUCP (Martin Golding) (06/27/91)

In <843@adpplz.UUCP> martin@adpplz.UUCP (Martin Golding) writes:

A bunch of nonsense, unless you realize I misread the previous post,
and was thinking of a string of ascii digits. Then it all becomes
clear, and I stand by everything I said, all of which was irrelevant.

Sorry.

Martin Golding    | sync, sync, sync, sank ... sunk:
Dod #0236         |  He who steals my code steals trash.
A poor old decrepit Pick programmer. Sympathize at:
{mcspdx,pdxgate}!adpplz!martin or martin@adpplz.uucp

torek@elf.ee.lbl.gov (Chris Torek) (06/27/91)

>In article <14583@dog.ee.lbl.gov> I wrote:
>>The [`why'] question is the hardest.  What *does* that 33 do?  I have no
>>idea. ...

In article <1991Jun26.112724.20539@syd.dit.CSIRO.AU>
evans@syd.dit.CSIRO.AU (Bruce.Evans) writes:
>I don't think there are any theoretical mysteries behind the 33. 33 = 32 + 1.
>The 32 is to avoid mixing bits with previous hash bits. 256 would avoid all
>mixing (with 8-bit chars) but it would throw away too many bits. The 1 is to
>avoid throwing away bits.

This answer just does not hold up: if this were the case, 65, 129,
etc., should work as well as 33.  They do not.  Why not?

>The following hash function worked well for me in an assembler. ...
>   length = strlen(str); ... hashval = hconv(str[-1]) * MULTIPLIER;

At this point, str[-1] is unlikely to exist.  The function seems to
expect a pointer to the last character, rather than the first.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov