[mod.compilers] Looking for Minimal Perfect Hash Functions

johnl@ima.UUCP (01/25/87)

Hi,

Has anyone implemented a program for finding minimal perfect hashing
functions?  The best reference that I have for them is CACM May '85 "A
Polynomial Time Generator for Minimal Perfect Hash Functions".  The
algorithm is somewhat lengthy (and messy) so I would really not like
to have to re-invent the wheel.  If you have such a beastie, could you
please send me a copy?  Thanks.  And maybe mod.sources would like a
copy too.

Thanks in advance,
Tony ;-)
[I haven't seen anything since then, but as always encourage submissions
from readers. -John]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (01/28/87)

Original-sender:

I am curious to know if any really uses this technique (of minimal
perfect hashing).  The reason I ask is that it involves polynomials
which are rather time consuming to compute.  Or am I wrong about that?
I always use a binary search method since that comsumes less than 10
compares of strings.  Usually the compares fail on the first
character.  So in about 10 add's and divide by two (hopefully done
with a shift) followed by a compare, you are done.  That's pretty
quick.  Also, the keyword list can change easily without much rewrite
of the code.

What does the typical "real" polynomial come out to be?  Does it
really take less time than the binary search method.  (I would judge
this based upon a miss of the table since most identifiers in a
program are not keywords.)
-- 
Perry Smith
pedz@bobkat
{ti-csl,infotel}!pollux!bobkat!pedz
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (02/03/87)

In response to the article suggesting that binary search might be more
efficient than a minimal perfect hashing function for keyword search:

You may be right about the superiority of binary search over the use of
hashing in a keyword table -- it depends crucially on several details
which most people tend to overlook when they describe their favorite
keyword-determination algorithm.  For instance, when you do your "up to
10 string compares", do you call a function each time?  If so, it almost
certainly loses to a good hashing scheme.  (Actually, 10 compares seems
too high, since the size of the average keyword table is probably
about 50 items, but even 6 compares might be slower than hashing.)

As for hashing schemes, the efficiency issues include: time to calculate
the hash function; number of keywords that collide ("perfectness");
expected ratio of non-keywords to keywords; density of keywords in the
table ("minimalness").  Although it is obviously an advantage to have a
perfect hashing function, because there need be no mechanism to resolve
collisions, it actually WASTES time to have a minimal sized table,
because every non-keyword will then collide with a keyword when hashed,
requiring a string compare to verify that it is not a keyword.  Unless
you are cramped for space, or most of your identifiers are expected to
be keywords, the best density for your hash table is less than 1/5 full,
so that 4 out of 5 non-keyword identifiers will hash to an empty location,
which can be instantly recognized as such without a string compare.

Given that you don't want a MINIMAL hash function, the construction
of a PERFECT, very fast one is not hard at all.  This is how I did it
for two different languages:  With a list of keywords in front of
you, find two or three trivally computable attributes of an identifier that
distinguish all the keywords from one another.  These attributes can almost
always be chosen from the following: first, second or third letters;
identifier length; last or second last letters.  This should take you
about half an hour at most.  Now, let's say you decide to use first letter,
third letter and identifier length.  The next decision is to choose a
table size large enough to make most identifiers hash to empty positions,
say 257 (a prime number is best).  The only remaining task is to write a
program that finds values for the variables (i, j, k) in the function below,
such that there are no collisions:

	len = strlen(id);
	hash = ( i * id[0] + j * id[len>=3 ? 2 : len] + k * len ) % 257;

The program can start with, say, (i,j,k) = (1,10,40) and iterate through
increasing values in any order you choose.  Make sure, however, that i,j
and k are large enough to allow at least 257 different values to be pro-
duced for the mod (%) operation.  You will find that, with less than 50
keywords and a density of less than 1/5, you don't have to try many
combinations of i,j and k to hit a perfect hash.  My program ran for a
couple of minutes in both cases.  The above hash function takes three
multiply's, two adds, a mod, and a check on the length of the id.  Unlike
the binary search method, no additional string compare is necessary for 4
out of 5 non-keyword identifiers, although every keyword requires a string
compare.  (I assume that you would already have in hand the length, "len",
of the identifier, if you care at all about efficiency, so I do not include
the time to calculate strlen as part of the hash.  In the typical lexer that
reads identifiers into a token buffer, it can be computed in one instruction
by subtracting the token buffer address from the pointer to the end of the
token in the buffer.)

My program to find a perfect hash function as described above is all of 93
lines of C, including reading the list of keywords from a file into an array,
and printing out the values of i, j, k and the hashed values for each
keyword.  If anyone thinks it worthwhile, I'd be happy to mail it to them.

Michael Condict		{ihnp4|vax135}!m10ux!mnc
AT&T Bell Labs		(201)582-5911    MH 3B-416
Murray Hill, NJ
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request