[comp.compilers] hash specifics: GOOD NEWS

te@prosun.first.gmd.de (Thilo Ernst) (12/17/90)

Hello hash-fellows,

In a 1985 paper in Comm. of the ACM, T.J. Sager of the University of Missouri
presented the so-called mincycle algorithm for computing perfect (collision-
free) hash functions.
Although the problem is nasty (NP-complete) in general, the algorithm
was shown to work in polynomial time for virtually all cases.
Sager's hash functions have the general form

    H(w) = ( h0(w) + g (h1 (w)) + g (h2 (w)) ) mod M

which doesn't seem very efficient on first glance, but have a closer look:

h0, h1, h2: 
    three functions which map the string into a triple of (bounded) 
    integers - usually  ord (w[j]) mod k, with k some power of 2. 
    The only restriction on their simplicity is that there be no
    fatal conflicts, i.e. two keywords with the same (h0, h1, h2)
    triple. 
M:  equals N, the number of keywords to hash, or slightly greater 
    (may be forced to be a power of 2, too)
g:  a stored  mapping, i.e. an integer array.
    This is the main output of the mincycle algorithm. Its size will
    be around N/3.

As you see, all operations occuring in the definition can be implemented very 
efficiently on machine level. h0, h1, h2 may be simple bitwise AND's on 
single bytes of the string to hash; for small keyword sets, one of them 
can be set to zero, that is, omitted in practice.

M is an output of Sager's algorithm, with the property M>=N. However (good
news continued), in all his (and our) tests, M was equal to N. That means,
there are no gaps in the hashing table; Sager's hash functions are *minimal*
(although he had no proof). BTW, Sager computed MPHF's for keyword sets up to
a size of 512.

I don't know if this is the best result about hash functions, but in my
opinion it is a significant breakthrough compared to the work listed in the
posting of J. R. Levine, which falls in two categories: bad or unknown
algorithm complexity (restricting the keyword set to small sizes) and bad or
inefficient resulting hash functions (non-perfect/non-minimal HF's,
large-number divisions required, etc.).

And now for the best: we have done it, Adriaan. We included an implementation
of the mincycle algorithm (with some improvements and optimizations) in our
compiler compiler to speed up the lexical analyzers generated and were very
satisfied with the results. Our largest example had 350 keywords and took
about ten minutes to compute a MPHF on a PC/AT.

The implementation was done in Modula-2, and I don't know if it would be of
use for you. I'd rather suggest you to e-mail me your keyword set(s) and to
look forward to my reply, including MPHF, in January.

(Sorry for errors/inexact information - unfortunately, I don't have Sager's 
paper here; neither my implementation and the tech. rep. about it.
However, to meet Adriaan Degroot's mail deadline, I had to write here & now.
Also, sorry for abuse of the English language, if any.) 


  Regards,  Thilo.

Thilo Ernst  ( te@gmdtub.uucp, te@prosun.first.gmd.de )
Institute of Informatics and Computing Techniques Berlin
           Rudower Chaussee 5, D-O-1199 Berlin, Germany
Phone:   (+37-2-) 6 74 51 52,   (+49-30-) 25 49 91 16
Fax:     (+37-2-) 6 76 22 00
[Thanks for pointing this one out, I'd completely missed it.  The reference is
Sager, Thomas J., "A Polynomial Time Generator for Minimal Perfect Hash
Functions," CACM 28:5, May 1985, pp. 523-532.  The article includes a listing
of a Pascal version of his algorithm. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

jeffj@cs.umr.edu (Jeff Jenness) (12/20/90)

In article <9012171443.AA18297@prosun.first.gmd.de> Thilo Ernst <te@prosun.first.gmd.de> writes:
>In a 1985 paper in Comm. of the ACM, T.J. Sager of the University of Missouri
>presented the so-called mincycle algorithm for computing perfect (collision-
>free) hash functions.
>Although the problem is nasty (NP-complete) in general, the algorithm
>was shown to work in polynomial time for virtually all cases.

Dr. Sager is working on his algorithm and feels that he has some better
results.  In the paper in the CACM he claimed that the algorithm ran in
polynomial time where the degree was the 6th power.  He feels that the
algorithm is actually O(n^3).  He has some software running on an IBM PC
in Turbo Pascal that he is making available(he asks $10 for the cost of
the software, if you provide a disk and mailer).  If you wish to find out
how to get it then write me email and I will respond individually.  There
has also been some work done elsewhere to improve upon the results of Dr.
Sager by providing a better set of "pseudo random" hash function.  If
time permits, I will construct a bibliography on the papers that I have
on MPHF's.

Jeff Jenness
University of Missouri - Rolla
jeffj@cs.umr.edu
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.