[comp.compilers] Hash specifics

degroot@cpsc.ucalgary.ca (Adriaan Degroot) (12/12/90)

How does one build a hash function with minimal table size and minimal
collisions for a known set of hash strings?

i.e. I want to hash the one hundred + opcodes for the Intel 386 processor
into the smallest possible table with the fewest collisions.  Is there any
way of quickly finding a function to do that, or am I stuck with endless
tweaking of the hash parameters? If anyone has done this % before, please
let me know by email degroot@cpsc.ucalgary.ca, before Dec 20th or so
(after that mail 'll bounce)

This is NOT homework.

[See the next message. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

johnl@iecc.cambridge.ma.us (John R. Levine) (12/12/90)

In article <9012111913.AA07310@cs-sun-fsa.cpsc.ucalgary.ca> you write:
>How does one build a hash function with minimal table size and minimal
>collisions for a known set of hash strings?

The problem of designing a hash algorithm that maps N identifiers into
unique slots in an N entry table is known as perfect hashing.  It has been
studied sporadically over the years.  There are algorithms known, but they
are all very slow.  Here are some references:

R. J. Cichelli, "Minimal perfect hash functions made simple," CACM 23:1,
Jan 1980, pp 17-19.

G. Jaeschke et al., comment on article above, CACM 23:12, Dec 1980, pp.
728-729.

G. Jaeschke, "Reciprocal hashing: a method for generating minimal perfect
hash functions," CACM 24:12, Dec 1981, pp. 829-833.

R. Sprugnoli, "Perfect hashing functions: a single probe retrieving method
for static sets," CACM 20:11, Nov 1977, pp. 841-850.

A brief visit to my 40 shelf-feet of computer journals didn't turn up
anything more recent except for some data base applications.  If someone
has a breakthrough in perfect hashing to report, I'd love to hear about it.
-- 
Regards,
John Levine, johnl@iecc.cambridge.ma.us, {spdcc|ima|world}!iecc!johnl
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

chased@Eng.Sun.COM (David Chase) (12/12/90)

> If someone has a breakthrough in perfect hashing to report, I'd love to
> hear about it.

CACM 33:6 (June 1990) "Fast Hashing of Variable Length Strings"
by Peter K. Pearson.

He discusses use of hash functions of the form:

  h = 0;
  n = length(string);

  for i = 1 through n do
    h = Table [ h XOR string [i]];
  enddo;

  return h;

where Table is a 256-byte array.  Each character in the string hashed
costs an XOR operation and an indexed read from memory (above the costs
of reading the character).  This function can "sometimes be modified to
produce a minimal perfect hashing function over a modest list of
words."

It's worth a look; I found it to be pleasant reading.

David Chase
Sun Microsystems
[It's a cute hack.  I note that he references the 1977 and 1980 articles
listed in my previous article, nothing else since then. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

pardo@cs.washington.edu (David Keppel) (12/12/90)

>In article <9012111913.AA07310@cs-sun-fsa.cpsc.ucalgary.ca>:
>>[Perfect hash functions?]

johnl@iecc.cambridge.ma.us (John R. Levine) writes:
>[References.]

See also `gperf', a part of the `libg++' distribution from the Free
Software Foundation.  The g++ libraries are available via anonymous
ftp from `prep.ai.mit.edu'.  (Please ftp before/after hours if you
can!)

	;-D oN  ( Perfect Chowhouse Hash )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

henry@zoo.toronto.edu (Henry Spencer) (12/13/90)

>>How does one build a hash function with minimal table size and minimal
>>collisions for a known set of hash strings?
>
>The problem of designing a hash algorithm that maps N identifiers into
>unique slots in an N entry table is known as perfect hashing...

Note that he didn't ask for zero waste space or zero collisions, just for
minimal both. Interpreting "minimal" not too strictly, this doesn't require
perfect hashing.  I don't have any specific algorithms to suggest for
imperfect hashing, however -- it's never seemed difficult to get a table
with only one or two collisions and not much extra space.

Actually, I have never figured out why people seem so obsessed with perfect
hashing, since the hash functions typically seem noticeably expensive, and
it has always seemed to me that a fast hashing function plus fast resolution
of occasional collisions would be easier and just as good.
-- 
Henry Spencer at U of Toronto Zoology, henry@zoo.toronto.edu   utzoo!henry
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

baxter@zola.ICS.UCI.EDU (Ira Baxter) (12/13/90)

In <1990Dec12.221425.569@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>Actually, I have never figured out why people seem so obsessed with perfect
>hashing, since the hash functions typically seem noticeably expensive, and
>it has always seemed to me that a fast hashing function plus fast resolution
>of occasional collisions would be easier and just as good.

As far as I can tell, perfect hashing costs extra.

If you have a lexer independent of the parsing mechanism, so that it
hasn't got any idea whether the next string is a keyword or not
(assemblers scanning the opcode field violate this requirement), then
you'll have to do a symbol/string table probe anyway, which can't be done
using a perfect hash.  So why not put the keywords into the string table
with a tag bit noting they are keywords?  The lexer now collects a
word-atom, and looks it up; if tagged, it reports the corresponding
keyword, if not, it reports a user symbol.  This scheme saves the
*additional* cost of doing a perfect hash.

--
Ira Baxter

-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

rfg@ncd.com (Ron Guilmette) (12/13/90)

In article <9012111913.AA07310@cs-sun-fsa.cpsc.ucalgary.ca> degroot@cpsc.ucalgary.ca (Adriaan Degroot) writes:
>How does one build a hash function with minimal table size and minimal
>collisions for a known set of hash strings?

As others have already noted, Doug Schmidt wrote a cute program called
gperf (in C++) which is useful for finding perfect hash functions.

Among the possible command line arguments for Doug's program are a set of
options which allow you to specify (for instance) the particular character
positions from the input (search key) strings which you want to
participate in the hashing.  So (for instance) you can tell the program
only to hash characters 1 thru 3 and the last 2 characters of each string.

The unfortunate part (and this is something that I ragged on Doug about at
length) is that the program is not setup to automatically try various
possibilities for the `hash bytes subset' option(s).  In other words, you
may tell the program to go and look for a perfect hash function for a
certain input (key) set and a certain table size and a certain `hash bytes
subset' and it may simply come back and tell you that it ain't possible.

Also, the program only uses one particular (simple & fast) hash function.
I can't reacll exactly what it does, but that doesn't matter for the point
I'm making.  The point is this.  Let's say that the hash function tries
simply XOR'ing the bytes together, and then masking off the low bits to
get the hash key when done.  Well, gperf will never tell you that you
might have been able to get a faster hash function (or even one that
actually works given your stated table size constraints) if (for example)
the hash function had simply *ADDED* the byte values together rather that
XOR'ing.  In other words, gperf explores only a small area within the
(multidimensional?) space of all possible (or all useful) hashing
functions.

That's not to say that it isn't a nice an/or useful tool.  It is both.

-- 

// Ron Guilmette  -  C++ Entomologist
// Internet: rfg@ncd.com      uucp: ...uunet!lupine!rfg
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

rfg@uunet.UU.NET (Ron Guilmette) (12/13/90)

In article <9012112133.AA00452@rbbb.Eng.Sun.COM> chased@Eng.Sun.COM (David Chase) writes:
+CACM 33:6 (June 1990) "Fast Hashing of Variable Length Strings"
+by Peter K. Pearson.
+
+He discusses use of hash functions of the form:
+
+  h = 0;
+  n = length(string);
+
+  for i = 1 through n do
+    h = Table [ h XOR string [i]];
+  enddo;
+
+  return h;

Some time back, I wrote a program which used the following hash function
to try (for a given input set of input strings) to find an "efficient"
hash function for the given set (and some particular table size).

I defined "efficiency" somewhat arbitrarily as the number members of the
input (key) set divided by the number of "probes" (or "key comparisons")
which would be needed to find each and every member of the input set in
the hash table exactly once.  (Note that I was assuming that the hash
table in question would be the kind used in compilers & assemblers; i.e.
one which would be fully "precomputed" ahead of time and used only for
lookups, and NEVER for ADDS or DELETES.)

static unsigned hash (const char *s, unsigned prime)
{
  unsigned ret_val = 0;

  for (; *s; s++)
    ret_val = (*s ^ ret_val) * prime;
  return ret_val & hash_mask;
}

This hash function takes a second parameter `prime' which is used to help
produce `well mangled' output keys from the input keys (i.e. strings).
Knuth documents that fact that using prime numbers in hash functions is
known to tend to make these functions more "efficient" (i.e. it tends to
make the output keys more evenly distributed over the range).

(Note also that this function assumes that `hash_mask' is some constant
defined earlier on.  Typically, it has a value like 2**N-1, where N is
some integer.  The `primary' hash table size is 2**N.)

The program I wrote tried various values for `prime' up to some fixed
limit (i.e. 8k).  I ran the program for various sets of input strings
(both large and small) and for various table sizes.

The program reported the "efficiency" rating of the primes tested.
Curiously, the number 47 always rated in the top 10 regardless of the set
of input keys used or the primary table size used.  I have no explanation
for this.

Moral of the story: If you need a reasonably efficient hashing function
for some arbitrary set of strings (e.g. identifiers or keywords), and if
you need it fast, use the function shown above and substitute `47' for
`prime'.

P.S.  The above code is not copyrighted, but its `look & feel' may be. :-)

-- 

// Ron Guilmette  -  C++ Entomologist
// Internet: rfg@ncd.com      uucp: ...uunet!lupine!rfg
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

rfg@ncd.com (Ron Guilmette) (12/13/90)

In article <1990Dec12.221425.569@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>
>Actually, I have never figured out why people seem so obsessed with perfect
>hashing, since the hash functions typically seem noticeably expensive, and
>it has always seemed to me that a fast hashing function plus fast resolution
>of occasional collisions would be easier and just as good.

Henry's point is well taken, however there are *some* places where I think
that it is probably worth the effort to try to find a *perfect* hash
function.  Keyword tables in compilers and opcode tables in assemblers
come to mind.

So perfect hash functions may be useful somtimes.

Note however that *minimal* perfect hash functions are not only never
absolutely *required*, but (as Doug Schmidt pointed out to me) they may
actually be *inefficient*!!!

The reason is that in a not-quite-minimal (but still perfect) hash
function, you will never have any "collisions" (so you don't have to waste
time on dealing with such cases) and also, if you happen to look up an
input key which does not even have a corresponding entry in your hash
table, then you will hash that input key into a hash key whose
corresponding primary table entry is empty!  In such cases, you actually
SAVE TIME (relative to the case where you have a *minimal* perfect hash
function) because you won't even have to do a single comparison (of any
pair of complete keys).

[In a later message, Ron also points out:]

See also "The Art of Computer Programming" by Donald Knuth, book 3,
section 6.4.

-- 

// Ron Guilmette  -  C++ Entomologist
// Internet: rfg@ncd.com      uucp: ...uunet!lupine!rfg
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

bart@cs.uoregon.edu (Barton Christopher Massey) (12/14/90)

In article <9012122107.aa06454@ICS.UCI.EDU> Ira Baxter <baxter@zola.ICS.UCI.EDU> writes:
> If you have a lexer independent of the parsing mechanism, so that it
> hasn't got any idea whether the next string is a keyword or not
> (assemblers scanning the opcode field violate this requirement), then
> you'll have to do a symbol/string table probe anyway

You can very efficiently decide whether the next input belongs to a given
set of keywords or is an identifier as you read it in using a DFA or some
similar recognizer.  Given the wide availability of FLEX, this is the
approach I almost always use -- dodges both the problems of efficient
keyword hashing and separating identifiers from keywords.

I tend to agree with Henry Spencer that on modern machines, "imperfect
hashing" is likely to be good enough -- after all, how many keywords are we
talking about, anyway?  A couple hundred, max?  Let's just increase the
table size until some really cheap hash function doesn't generate any
collisions.  How big a table will we end up with?  A couple thousand
entries?  That's "only" 8K of pointers on a 32-bit box...  An interesting
exercise would be a program which, for a given set of keywords, and a given
table size limit, tried more and more expensive hash functions until the
keywords hashed without collisions into the table and then emitted the
function.  Kind of like gperf with less lofty ambitions :-).

					Bart Massey
					bart@cs.uoregon.edu
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

roth@resi.rice.edu (Jerry Roth) (12/15/90)

In article <1990Dec13.201211.6483@cs.uoregon.edu> bart@cs.uoregon.edu (Barton Christopher Massey) writes:
>You can very efficiently decide whether the next input belongs to a given
>set of keywords or is an identifier as you read it in using a DFA or some
>similar recognizer.  Given the wide availability of FLEX, this is the
>approach I almost always use -- dodges both the problems of efficient
>keyword hashing and separating identifiers from keywords.

I have always wondered which method is truly more efficient. The two
methods are:

(a) Preload all keywords into the hash table and have the lexer distinguish
identifiers from keywords via lookup. Keywords and identifiers are then
treated the same until after the lookup.

(b) Add a regular expression for each keyword to the lexer followed by a
regular expression to catch identifiers. Keywords are then handled directly
while only the identifiers get hashed.

It seems method (a) has the advantage of keeping the lexer simple in that it
has fewer expressions to attempt to match.  While method (b) avoids hashing
keywords.

Can anyone tell me if adding more expressions to my lexer (one per keyword)
causes it to execute less efficiently, assuming that I am using LEX/FLEX to
create it. If so, is it less efficient than hashing each keyword when
encountered.

Thanks,
Jerry Roth
[It is my impression that the speed at which a flex parser does what it does
is unrelated to the complexity of the state machine, although its size goes
up, but I'd be glad to hear from someone who knows definitively. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

oz@nexus.yorku.ca (12/15/90)

In article <2977@lupine.NCD.COM> rfg@ncd.com (Ron Guilmette) writes:
>
>As others have already noted, Doug Schmidt wrote a cute program called
>gperf (in C++) which is useful for finding perfect hash functions.

A PD almost-perfect hash generator based on Cicelli's work has been around
since early 80s. I include it below. I have used it successfully in a few
places. Here is the diag output of the program for C keywords: [note that
in this case, it generates a table of 51 entries for 36 keywords]

| Perfect hash function finder, CDH Ver. 2.9
| Start time = Fri Dec 14 12:11:42 1990
| 36 keywords, 21 distinct letters.
| The associated char value limit is 21
| The table size limit is 100
| The search ordering is
|  else double case continue float typeof typedef default
|  const short struct sizeof signed static extern inline int
|  if for enum volatile void while char register do return
|  unsigned __alignof switch auto goto union asm long break
| 
| Success! Associated Char Values Follow:
|  _ =18, a =11, b = 9, c = 1, d = 0, e = 0, f = 3, g = 9, h =15,
|  i = 2, k =21, l =20, m =18, n =12, o =21, r =21, s =10, t = 5, u =21,
|  v =20, w =20,
| Hash min = 4, max = 50, spread = 47
|  else 4, case 5, double 6, if 7, inline 8,
|  continue 9, int 10, const 11, default 12, float 13,
|  typeof 14, typedef 15, signed 16, static 17, extern 18,
|  sizeof 19, short 20, struct 21, enum 22, do 23,
|  void 24, while 25, char 26, for 27, volatile 28,
|  unsigned 29, __alignof 30, switch 31, asm 32, long 33,
|  goto 34, break 35, auto 36, union 38, return 39,
|  register 50,
| 
| Total search invocations = 13727, max depth = 35
|  Started Fri Dec 14 12:11:42 1990
| Finished Fri Dec 14 12:11:47 1990


enjoy...	oz

ps: If you hack this program to look & work nicer, plz drop me a copy.
---
Internet: oz@nexus.yorku.ca, UUCP: utzoo/utai!yunexus!oz
[I have added the perfect hash program to the comp.compilers library,
since it's kind of big to put in a message.  Drop me a line if you want
a copy. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

evan@Sun.COM (Evan Bigall) (12/16/90)

In article <2980@lupine.NCD.COM> rfg@ncd.com (Ron Guilmette) writes:
   Henry's point is well taken, however there are *some* places where I think
   that it is probably worth the effort to try to find a *perfect* hash
   function.  Keyword tables in compilers and opcode tables in assemblers
   come to mind.

The time when I always find myself looking for a perfect hash function, is
when I am parsing long keyword options (ie: -ansi etc...) or some other case
when I have a small set of possibilities but want to avoid using:

	if (!strcmp("hello", arg))
	  {}
	else if (!strcmp("there", arg))
	  {}
	else if (!strcmp("world", arg))
	  {}
        else 
	  error;

Maybe I'm just being silly, but the above code, and the tiny hand crafted
lexers I usually end up both make me gag.

Because these situations often come up several times within one program I'd
much rather put in a quick perfect hash, then have to put in the code to deal
with collisions. 

I have one program where I ended up having 15 or so different instances of
this situation.  I ended up giving them start conditions and just calling
the (already terribly complicated) flex lexer, but that seems like overkill
for everyday use.

/Evan
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

vern@horse.ee.lbl.gov (Vern Paxson) (12/17/90)

In article <1990Dec14.164415.326@rice.edu> roth@resi.rice.edu (Jerry Roth) writes:
> Can anyone tell me if adding more expressions to my lexer (one per keyword)
> causes it to execute less efficiently, assuming that I am using LEX/FLEX to
> create it....
> ...
> [It is my impression that the speed at which a flex parser does what it does
> is unrelated to the complexity of the state machine, although its size goes
> up, but I'd be glad to hear from someone who knows definitively. -John]

Right, a finite automaton scanner like flex's looks at each character just
once, regardless of the complexity of the patterns it is matching against (*).
Adding more regular expressions to the rule set makes the scanner tables
bigger but does not change the speed of scanning (except for second-order
effects like page faults).  So if space is not a concern, it's easiest and
fastest to match keywords using a separate rule for each one.

		Vern

	Vern Paxson				vern@helios.ee.lbl.gov
	Real Time Systems			ucbvax!helios.ee.lbl.gov!vern
	Lawrence Berkeley Laboratory		(415) 486-7504

(*) There are some cases when the scanner must rescan text, though they
    don't arise when using individual rules to match keywords plus a
    generic "identifier" rule to match all non-keyword identifiers.
    flex's -b option helps identify and eliminate these cases.
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

schmidt@oberkampf.ICS.UCI.EDU (Doug Schmidt) (12/17/90)

In article <14101@june.cs.washington.edu> pardo@cs.washington.edu (David Keppel) writes:
++ See also `gperf', a part of the `libg++' distribution from the Free
++ Software Foundation.  The g++ libraries are available via anonymous
++ ftp from `prep.ai.mit.edu'.  (Please ftp before/after hours if you
++ can!)
++
++ 	;-D oN  ( Perfect Chowhouse Hash )  Pardo

My paper describing gperf is available in the proceedings of the
USENIX C++ Workshop held in April 1990.  The paper explains the
algorithm used to generate perfect hash functions efficiently.

        Doug
--
schmidt@ics.uci.edu (ARPA)
office: (714) 856-4043
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

preston@ariel.rice.edu (Preston Briggs) (12/17/90)

plx!plxsun!evan@Sun.COM (Evan Bigall) writes:

>The time when I always find myself looking for a perfect hash function, is
>when I am parsing long keyword options (ie: -ansi etc...) or some other case
>when I have a small set of possibilities but want to avoid using:
>
>	if (!strcmp("hello", arg))
>	  {}
>	else if (!strcmp("there", arg))
>	  {}
>	else if (!strcmp("world", arg))

Another poster (Ira Baxter?) mentioned how to handle these efficiently
in a typical compiler-like case.  I'll expand a little.

Since we're building a symbol table anyway, we're going to need to hash every
identifier anyway.  I use one hash table to hash keywords, identifiers, and
strings as part of my scanner.  Each time a new id or string is encountered,
I allocate a new integer for it.  If the integer is < some small number, an
id is actually a keyword.

The rest of the front-end simply looks at integers which serve as indices
into a string table.  Makes it easy to switch on "strings" when all are
reduced to integers.

This works especially well with Knuth's WEB system.  WEB has a notion of
preprocessed-strings.  I.e.
	"a string"

When Tangling, a string table is maintained and all the strings are replaced
in the output text with integers.  Further, the string table is written to a
pool file, suitable for initializing your compiler's hash table (or TeX or
whatever).

Pre-processed strings are also a nice way to handle error messages and so
forth, potentially saving much storage.

LISPers will wonder what the fuss is about, as they use symbols as a matter
of course.

Preston Briggs
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

brain@eos.ncsu.edu (Marshall Brain) (12/19/90)

Cichelli's work in Perfect Hashing was extended in the paper:

	"Near-Perfect Hashing of Large Word Sets", by Brain and Tharp,
	Software - Practice and Experience, Vol. 19, No. 10 (October 1989)
	pp. 967-978.

This algorithm could handle around 1000 words. An extremely simple algorithm,
also based on Cichelli's work and able to handle up to 5000 or so words 
appeared in the paper:

	"Perfect Hashing Using Sparse Matrix Packing", by Brain and Tharp,
	Information Systems, Vol. 15, No. 3 (1990), pp. 281-290.
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

rfpfeifl@watcgl.uwaterloo.ca (Ron Pfeifle) (12/22/90)

[Here's another reference.]

>From: "Gordon V. Cormack" <gvcormack@watdragon>

Cormack G.V., Horspool R.N., and Kaiserswerth M.
Practical Perfect Hashing,
Computer Journal 28:1 (1985), 54-58.
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.