te@prosun.first.gmd.de (Thilo Ernst) (12/17/90)
Hello hash-fellows, In a 1985 paper in Comm. of the ACM, T.J. Sager of the University of Missouri presented the so-called mincycle algorithm for computing perfect (collision- free) hash functions. Although the problem is nasty (NP-complete) in general, the algorithm was shown to work in polynomial time for virtually all cases. Sager's hash functions have the general form H(w) = ( h0(w) + g (h1 (w)) + g (h2 (w)) ) mod M which doesn't seem very efficient on first glance, but have a closer look: h0, h1, h2: three functions which map the string into a triple of (bounded) integers - usually ord (w[j]) mod k, with k some power of 2. The only restriction on their simplicity is that there be no fatal conflicts, i.e. two keywords with the same (h0, h1, h2) triple. M: equals N, the number of keywords to hash, or slightly greater (may be forced to be a power of 2, too) g: a stored mapping, i.e. an integer array. This is the main output of the mincycle algorithm. Its size will be around N/3. As you see, all operations occuring in the definition can be implemented very efficiently on machine level. h0, h1, h2 may be simple bitwise AND's on single bytes of the string to hash; for small keyword sets, one of them can be set to zero, that is, omitted in practice. M is an output of Sager's algorithm, with the property M>=N. However (good news continued), in all his (and our) tests, M was equal to N. That means, there are no gaps in the hashing table; Sager's hash functions are *minimal* (although he had no proof). BTW, Sager computed MPHF's for keyword sets up to a size of 512. I don't know if this is the best result about hash functions, but in my opinion it is a significant breakthrough compared to the work listed in the posting of J. R. Levine, which falls in two categories: bad or unknown algorithm complexity (restricting the keyword set to small sizes) and bad or inefficient resulting hash functions (non-perfect/non-minimal HF's, large-number divisions required, etc.). And now for the best: we have done it, Adriaan. We included an implementation of the mincycle algorithm (with some improvements and optimizations) in our compiler compiler to speed up the lexical analyzers generated and were very satisfied with the results. Our largest example had 350 keywords and took about ten minutes to compute a MPHF on a PC/AT. The implementation was done in Modula-2, and I don't know if it would be of use for you. I'd rather suggest you to e-mail me your keyword set(s) and to look forward to my reply, including MPHF, in January. (Sorry for errors/inexact information - unfortunately, I don't have Sager's paper here; neither my implementation and the tech. rep. about it. However, to meet Adriaan Degroot's mail deadline, I had to write here & now. Also, sorry for abuse of the English language, if any.) Regards, Thilo. Thilo Ernst ( te@gmdtub.uucp, te@prosun.first.gmd.de ) Institute of Informatics and Computing Techniques Berlin Rudower Chaussee 5, D-O-1199 Berlin, Germany Phone: (+37-2-) 6 74 51 52, (+49-30-) 25 49 91 16 Fax: (+37-2-) 6 76 22 00 [Thanks for pointing this one out, I'd completely missed it. The reference is Sager, Thomas J., "A Polynomial Time Generator for Minimal Perfect Hash Functions," CACM 28:5, May 1985, pp. 523-532. The article includes a listing of a Pascal version of his algorithm. -John] -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
jeffj@cs.umr.edu (Jeff Jenness) (12/20/90)
In article <9012171443.AA18297@prosun.first.gmd.de> Thilo Ernst <te@prosun.first.gmd.de> writes: >In a 1985 paper in Comm. of the ACM, T.J. Sager of the University of Missouri >presented the so-called mincycle algorithm for computing perfect (collision- >free) hash functions. >Although the problem is nasty (NP-complete) in general, the algorithm >was shown to work in polynomial time for virtually all cases. Dr. Sager is working on his algorithm and feels that he has some better results. In the paper in the CACM he claimed that the algorithm ran in polynomial time where the degree was the 6th power. He feels that the algorithm is actually O(n^3). He has some software running on an IBM PC in Turbo Pascal that he is making available(he asks $10 for the cost of the software, if you provide a disk and mailer). If you wish to find out how to get it then write me email and I will respond individually. There has also been some work done elsewhere to improve upon the results of Dr. Sager by providing a better set of "pseudo random" hash function. If time permits, I will construct a bibliography on the papers that I have on MPHF's. Jeff Jenness University of Missouri - Rolla jeffj@cs.umr.edu -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.