[comp.lang.c] Hash functions in C

djones@megatest.UUCP (Dave Jones) (11/07/90)

I did a little bit of experimentation with the hash-functions which
have been posted here. Nothing scientific, mind you, but very interesting.

I wrote a program which takes a set of strings and stores each
in a hash-table. (Lookups are equivalent to stores, for our purposes,
because both operations hash a key and then look for it using rehashing.)
I made two versions, each based on my standard library hash-package, one
using the simple little add-and-shift function that I posted here, the other
using the CRC-16 algorithm. I then ran each on various data, including
source files and the UNIX spelling-checker dictionary "/usr/dict/words".
The result was that the simpler hash function consistently gave much better
store/lookup times that the CRC version did. Not even close.

Interestingly, the simpler hash-function was not only much quicker to
calculate, but it also gave far fewer collisions. Using the /usr/dict/words
data, each hash-function was called 24473 times -- once for each
word in the file. The shift-and-add version called the string
comparison routine 39655 times, which represents a collision rate of
about 0.6 collisions per lookup, which is right around the theoretical
rate for this particular algorithm. But the CRC-16 function called the
string-comparison routine no fewer than 270,662 times, for a collision-
rate of about 10.0 collisions (and re-collisions) per lookup!

I think I can rationalize that. The CRC-16 function does indeed
spread the set of all strings uniformly over the interval 0 .. 2**16 - 1.
But for n < 16 it does not uniformly spread short strings over the interval
0 .. 2**n - 1 in the bottom n bits. Also consider that in ASCII (w/o parity)
we are really only dealing with bit-streams in which every eighth bit
is a zero -- a very regular pattern for a 16-bit checksum. That may
have something to do with it, I dunno. A little experimenting suggests
that it may be possible to devise better polynomials than the CRC-16.
I got one that seems to work pretty good. But I don't think it can
distribute the keys well enough to compensate for the extra time it takes
to calculate a check-sum. (Besides that, any hash-function  which does not
at least take into account the range of the printable characters probably
will distribute typical text-strings less than optimally, I suspect.)

djones@megatest.UUCP (Dave Jones) (11/08/90)

I just now tested some hash-functions for an application which is
probably more typical, namely a compiler. The simplest function is
still a big winner on all counts.

I ran tests on a standard Pascal header-file containing 6290 identifiers,
including many duplicates (such as "procedure"). I got 370 collisions
using my hash-function, for a rate of only 0.06 collisions per lookup!
That was with a real big table. Lots of empty slots. When I reduced the
table size so that only about half the slots were empty, I got a little
under one collision per lookup, which is theoretical (1/2 + 1/2*1/2 + ...).
Under these circs, anything around one should generally be considered pretty
good. Surprisingly the CRC-16 hash-function was not too bad in this test,
giving a collision-rate of 1.24. Why it did so very poorly on the
dictionary, I don't know. And of course the relative slowness of
calculating it is also important, so I don't think it has anything at
all going for it as far as hashing char-strings is concerned. I'm wondering
why it was even suggested.

oz@yunexus.yorku.ca (Ozan Yigit) (11/13/90)

In article <14461@goofy.megatest.UUCP> djones@megatest.UUCP 
[Dave Jones] writes:

>I just now tested some hash-functions for an application which is
>probably more typical, namely a compiler. The simplest function is
>still a big winner on all counts.

Dave, you may also wish to take a look at the following.

%A B. J. McKenzie
%A R. Harries
%A T. Bell
%T Selecting a Hashing Algorithm
%J Software - Practice and Experience
%V 20
%N 2
%D Feb 1990
%P 209-224

This interesting article evaluates some of the more commonly known hash
functions from a few sources: Amsterdam Compiler Kit, Eth Modula-2 Cross
Compiler, Gnu-cpp, Gnu-cc1, pcc (4.3bsd), C++ (AT&T) and Icon. 

Their conclusions suggest that the algorithms of the style

	h(0) = 0; h(i) = k * h(i-1) + ch(i)	[for 1 <= i <= n]
	hash = h(n) MOD N

perform well provided k (a constant multiplier) and N (the table size) are
chosen well. They feel that most implementation of hash functions contain
far-from-optimal choices [so, what else is new?]. They also state:

	The experiments have shown that very small variations in N can
	produce large variations in the efficiency of the hash-table
	look-up, and that the popular view, that choice of a prime
	number will automatically ensure a good result, is not well
	founded.
								[pp 224]

They found that one of the simplest hashing functions (gnu-cpp) performed
well, both in speed and number of probes: [included verbatim from gnu-cpp.
k = 4, N = 1403 odd but not prime]
	
	#define HASHSIZE 1403		
	#define HASHSTEP(old, c) ((old << 2) + c)
	#define MAKE_POS(v) (v & ~0x80000000) /* make number positive */
	
	hashf (name, len, hashsize)
	     register U_CHAR *name;
	     register int len;
	     int hashsize;
	{
	  register int r = 0;
	
	  while (len--)
	    r = HASHSTEP (r, *name++);
	
	  return MAKE_POS (r) % hashsize;
	}


[ps: I have not yet tested this function, so I do not know if their results
are easily reproducible. For example, would making r unsigned, and avoiding
MAKE_POS crock slow this function down?]

enjoy..		oz
---
First learn your horn and all the theory.	Internet: oz@nexus.yorku.ca
Next develop a style. Then forget all that	uucp: utzoo/utai!yunexus!oz
and just play.		Charlie Parker [?]	York U. CCS: (416) 736 5257