[gnu.g++] perfect hash function for g++

schmidt%siam.ics.uci.edu@ORION.CF.UCI.EDU ("Douglas C. Schmidt") (09/22/88)
[sorry if this gets posted twice, our mailer is going berzerk lately.]

In article <323@igor.UUCP> dsb@Rational.COM (David S. Bakin) writes:
>I don't want to make a big fuss, especially as it is someone else who is
>being so good about writing code and posting it, but ...

And I don't like to flame critics, unless they didn't read and/or
understand the code in the first place ;-).

>A perfect hash function isn't the best for the purpose ... if any effort
>is made for it to be a minimal perfect hash function.  I notice that the
>g++ hash function just posted isn't minimal but at the same time some
>table entries are duplicated instead of being left NULL.

Sorry, but you've *completely* missed the point.  If you carefully
examine the hashing scheme then you'll see that those positions *are*
left ``NULL.''  In fact, when the hash function returns the value of
an ``empty'' place in the table there is simply no string comparison
at all.  I don't think you realize that the entries in the resword
table are *not* strings, they are #defines representing offsets into
an array of characters (called word_table).  Hence the declaration:

struct resword { short offset; short token; enum rid rid;};

                        ^^^  ce n'est pas une (char *), mais oui?

So each array element [i] in reswords is a record with 3 fields, an
offset into this word_table plus 2 other fields used by the compiler
to distinguish attributes of the reserved words.  However, what you
*really* missed is this: that the length of an identifier [i] in the
word_table is implicitly being stored in by the offset of the [i + 1]
entry.  That's the meaning of this cryptic piece of code:

 {	register int str_start = reswords[key].offset;   
   register int str_end   = reswords[key+1].offset;

Here we are calculating the length of the ``keyth'' entry in the
word_table starting at offset str_start.  The next block:

             while (str_start < str_end) 
               {
                 if (*str++ != word_table[str_start++]) 
                   {
                     return (NULL); /* bail if no match */
                   }
               }

only performs a char-by-char comparison IF THE LENGTH IS GREATER THAN
0.  That's why there are those ``repeated words'' in the resword table
(which aren't words at all, they're just offset numbers, and for
non-existent locations they are the same, so the offset of [i] ==
offset of [i + 1], which prevents the loop from executing even ONCE).
The only time an actual string comparison is performed is if the
implicit length between two adjacent entries in the table is greater
than 0.  Using this technique, i.e., a word_table and a resword array,
saves space, compared with storing an array of char pointers to
strings (since you don't need the NULL terminator, and a short is
usually half the size of a char *).

>then it is a good idea to have a perfect hash function that hashes randomly
>into a large table -- say one with at least 50% or even 75% null entries,
>so that only 1-in-2 or 1-in-4 probes results in a string comparison while
>the rest terminate immediately.

That is *precisely* what is going on here, there is *exactly* 1
comparison for each actual word in the table, and if the hash function
returns the index of an ``empty'' location there is NO string
comparison (in fact, even the comparison is INLINE, so you can't
complain about function call overhead).

In the future, you should be more careful about your data structure
analysis and code reading ability before you attempt to criticize
another person's work.  If you can provide a faster/simpler/smaller
method than this O(1) scheme, please feel free to share it with us.

Doug Schmidt