[net.lang] hash function generator

leichter@yale-com.UUCP (Jerry Leichter) (10/04/83)

There is work of two sorts done in this direction.  First, there are some
articles on generating PERFECT hash functions for given sets of inputs.
A perfect hash function has no collisions at all.  You would use something
like this to produce a very quick hash-table lookup for keywords in a compiler.
You can find these articles in the CACM within the last 5 years or so.

Second, there is a general approach, due to Carter and Wegman.  They
introduce "universal classes of hash functions".  Such a class has the
property that a randomly chosen member is highly likely to give a very good
hash ("highly likely" and "very good" having specific meanings in this context)
for ANY particular set of inputs to be hashed.  The only place I know of in
which this has been written up is in Proceedings of the 9th SIGACT (1977),
page 106.  They prove some practical classes of functions - essentially of
the form Ax+B mod C, where C is fixed for the class and A and B vary to give
all class members - are universal.

Beyond this, getting any sort of guarantees of the quality of a hash function
for a given, random set of inputs is extremely difficult.
							-- Jerry
					decvax!yale-comix!leichter leichter@yale

condict@csd1.UUCP (Michael Condict) (10/05/83)

Yes, there is are several ways to generate such hash functions.  One such
method was reported in Communications a few years back (1979?) in an article by
my old friend Richard Cichelli (sorry, Richard, you're not that old).  If he is
on the net I am sure he will be happy to respond to your request.  Meanwhile
I'll tell you what I can remember.

The assumption is slightly different from what you stated in that we allow
the hash function to produce any output it wants (between 1 and n), with the
objective, of course, of minimizing collisions and maximizing density of the
hash table.  So the input to the generator is a list of character strings
and an allowed range 1..n, rather than a list of string/number pairs.

A hash function is said to be perfect if it produces no collisions over the
entire set of allowed inputs.  It is said to be minimal if it produces a
completely dense hash table, that is, if its set of outputs form a contiguous
run of numbers.  As I recall, Cichelli's method would produce a minimal
perfect hash function in a quite reasonable amount of time (a few seconds
or minutes to get one for the 32 Pascal reserved words, using a PDP/11).

The form of the hash function was very simple.  It consisted of

	h(str) = code1[str[1]] + code2[str[strleng]] + strleng;

where str is the array of characters that is the input to the hash function,
strleng is its length, and code1, code2 are two arrays that are produced
by the generator.  They are of type: array [char] of integer;  Their purpose
is to convert the first and last characters of the string into numbers that,
when added with the string length will produce a unique number for each
string.  Of course one could use the 2nd, 3rd or other characters in str as
well, but remember that only the first and last are guaranteed to exist.
The problem, clearly, is that for some sets of inputs the combination of
first char, last char and length is not unique, in which case the method will
fail to produce a perfect hash function.

Sorry I cannot remember the algorithm at all, but this should tell you if it
is what you are looking for, and it shouldn't be hard to find in CACM.

Michael Condict   ...!cmcl2!csd1!condict
New York U.

charlie@genrad.UUCP (Charlie Havener) (10/05/83)

I have implemented the Cichelli Algorithm for finding Minimal Perfect
Hashing Functions in a C program under UNIX 4.1BSD. Is there enough
interest for me to post it to net.sources? Send me mail
...!decvax!genrad!charlie