degroot@cpsc.ucalgary.ca (Adriaan Degroot) (12/12/90)
How does one build a hash function with minimal table size and minimal collisions for a known set of hash strings? i.e. I want to hash the one hundred + opcodes for the Intel 386 processor into the smallest possible table with the fewest collisions. Is there any way of quickly finding a function to do that, or am I stuck with endless tweaking of the hash parameters? If anyone has done this % before, please let me know by email degroot@cpsc.ucalgary.ca, before Dec 20th or so (after that mail 'll bounce) This is NOT homework. [See the next message. -John] -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
johnl@iecc.cambridge.ma.us (John R. Levine) (12/12/90)
In article <9012111913.AA07310@cs-sun-fsa.cpsc.ucalgary.ca> you write: >How does one build a hash function with minimal table size and minimal >collisions for a known set of hash strings? The problem of designing a hash algorithm that maps N identifiers into unique slots in an N entry table is known as perfect hashing. It has been studied sporadically over the years. There are algorithms known, but they are all very slow. Here are some references: R. J. Cichelli, "Minimal perfect hash functions made simple," CACM 23:1, Jan 1980, pp 17-19. G. Jaeschke et al., comment on article above, CACM 23:12, Dec 1980, pp. 728-729. G. Jaeschke, "Reciprocal hashing: a method for generating minimal perfect hash functions," CACM 24:12, Dec 1981, pp. 829-833. R. Sprugnoli, "Perfect hashing functions: a single probe retrieving method for static sets," CACM 20:11, Nov 1977, pp. 841-850. A brief visit to my 40 shelf-feet of computer journals didn't turn up anything more recent except for some data base applications. If someone has a breakthrough in perfect hashing to report, I'd love to hear about it. -- Regards, John Levine, johnl@iecc.cambridge.ma.us, {spdcc|ima|world}!iecc!johnl -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
chased@Eng.Sun.COM (David Chase) (12/12/90)
> If someone has a breakthrough in perfect hashing to report, I'd love to > hear about it. CACM 33:6 (June 1990) "Fast Hashing of Variable Length Strings" by Peter K. Pearson. He discusses use of hash functions of the form: h = 0; n = length(string); for i = 1 through n do h = Table [ h XOR string [i]]; enddo; return h; where Table is a 256-byte array. Each character in the string hashed costs an XOR operation and an indexed read from memory (above the costs of reading the character). This function can "sometimes be modified to produce a minimal perfect hashing function over a modest list of words." It's worth a look; I found it to be pleasant reading. David Chase Sun Microsystems [It's a cute hack. I note that he references the 1977 and 1980 articles listed in my previous article, nothing else since then. -John] -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
pardo@cs.washington.edu (David Keppel) (12/12/90)
>In article <9012111913.AA07310@cs-sun-fsa.cpsc.ucalgary.ca>: >>[Perfect hash functions?] johnl@iecc.cambridge.ma.us (John R. Levine) writes: >[References.] See also `gperf', a part of the `libg++' distribution from the Free Software Foundation. The g++ libraries are available via anonymous ftp from `prep.ai.mit.edu'. (Please ftp before/after hours if you can!) ;-D oN ( Perfect Chowhouse Hash ) Pardo -- pardo@cs.washington.edu {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
henry@zoo.toronto.edu (Henry Spencer) (12/13/90)
>>How does one build a hash function with minimal table size and minimal >>collisions for a known set of hash strings? > >The problem of designing a hash algorithm that maps N identifiers into >unique slots in an N entry table is known as perfect hashing... Note that he didn't ask for zero waste space or zero collisions, just for minimal both. Interpreting "minimal" not too strictly, this doesn't require perfect hashing. I don't have any specific algorithms to suggest for imperfect hashing, however -- it's never seemed difficult to get a table with only one or two collisions and not much extra space. Actually, I have never figured out why people seem so obsessed with perfect hashing, since the hash functions typically seem noticeably expensive, and it has always seemed to me that a fast hashing function plus fast resolution of occasional collisions would be easier and just as good. -- Henry Spencer at U of Toronto Zoology, henry@zoo.toronto.edu utzoo!henry -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
baxter@zola.ICS.UCI.EDU (Ira Baxter) (12/13/90)
In <1990Dec12.221425.569@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >Actually, I have never figured out why people seem so obsessed with perfect >hashing, since the hash functions typically seem noticeably expensive, and >it has always seemed to me that a fast hashing function plus fast resolution >of occasional collisions would be easier and just as good. As far as I can tell, perfect hashing costs extra. If you have a lexer independent of the parsing mechanism, so that it hasn't got any idea whether the next string is a keyword or not (assemblers scanning the opcode field violate this requirement), then you'll have to do a symbol/string table probe anyway, which can't be done using a perfect hash. So why not put the keywords into the string table with a tag bit noting they are keywords? The lexer now collects a word-atom, and looks it up; if tagged, it reports the corresponding keyword, if not, it reports a user symbol. This scheme saves the *additional* cost of doing a perfect hash. -- Ira Baxter -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
rfg@ncd.com (Ron Guilmette) (12/13/90)
In article <9012111913.AA07310@cs-sun-fsa.cpsc.ucalgary.ca> degroot@cpsc.ucalgary.ca (Adriaan Degroot) writes: >How does one build a hash function with minimal table size and minimal >collisions for a known set of hash strings? As others have already noted, Doug Schmidt wrote a cute program called gperf (in C++) which is useful for finding perfect hash functions. Among the possible command line arguments for Doug's program are a set of options which allow you to specify (for instance) the particular character positions from the input (search key) strings which you want to participate in the hashing. So (for instance) you can tell the program only to hash characters 1 thru 3 and the last 2 characters of each string. The unfortunate part (and this is something that I ragged on Doug about at length) is that the program is not setup to automatically try various possibilities for the `hash bytes subset' option(s). In other words, you may tell the program to go and look for a perfect hash function for a certain input (key) set and a certain table size and a certain `hash bytes subset' and it may simply come back and tell you that it ain't possible. Also, the program only uses one particular (simple & fast) hash function. I can't reacll exactly what it does, but that doesn't matter for the point I'm making. The point is this. Let's say that the hash function tries simply XOR'ing the bytes together, and then masking off the low bits to get the hash key when done. Well, gperf will never tell you that you might have been able to get a faster hash function (or even one that actually works given your stated table size constraints) if (for example) the hash function had simply *ADDED* the byte values together rather that XOR'ing. In other words, gperf explores only a small area within the (multidimensional?) space of all possible (or all useful) hashing functions. That's not to say that it isn't a nice an/or useful tool. It is both. -- // Ron Guilmette - C++ Entomologist // Internet: rfg@ncd.com uucp: ...uunet!lupine!rfg -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
rfg@uunet.UU.NET (Ron Guilmette) (12/13/90)
In article <9012112133.AA00452@rbbb.Eng.Sun.COM> chased@Eng.Sun.COM (David Chase) writes: +CACM 33:6 (June 1990) "Fast Hashing of Variable Length Strings" +by Peter K. Pearson. + +He discusses use of hash functions of the form: + + h = 0; + n = length(string); + + for i = 1 through n do + h = Table [ h XOR string [i]]; + enddo; + + return h; Some time back, I wrote a program which used the following hash function to try (for a given input set of input strings) to find an "efficient" hash function for the given set (and some particular table size). I defined "efficiency" somewhat arbitrarily as the number members of the input (key) set divided by the number of "probes" (or "key comparisons") which would be needed to find each and every member of the input set in the hash table exactly once. (Note that I was assuming that the hash table in question would be the kind used in compilers & assemblers; i.e. one which would be fully "precomputed" ahead of time and used only for lookups, and NEVER for ADDS or DELETES.) static unsigned hash (const char *s, unsigned prime) { unsigned ret_val = 0; for (; *s; s++) ret_val = (*s ^ ret_val) * prime; return ret_val & hash_mask; } This hash function takes a second parameter `prime' which is used to help produce `well mangled' output keys from the input keys (i.e. strings). Knuth documents that fact that using prime numbers in hash functions is known to tend to make these functions more "efficient" (i.e. it tends to make the output keys more evenly distributed over the range). (Note also that this function assumes that `hash_mask' is some constant defined earlier on. Typically, it has a value like 2**N-1, where N is some integer. The `primary' hash table size is 2**N.) The program I wrote tried various values for `prime' up to some fixed limit (i.e. 8k). I ran the program for various sets of input strings (both large and small) and for various table sizes. The program reported the "efficiency" rating of the primes tested. Curiously, the number 47 always rated in the top 10 regardless of the set of input keys used or the primary table size used. I have no explanation for this. Moral of the story: If you need a reasonably efficient hashing function for some arbitrary set of strings (e.g. identifiers or keywords), and if you need it fast, use the function shown above and substitute `47' for `prime'. P.S. The above code is not copyrighted, but its `look & feel' may be. :-) -- // Ron Guilmette - C++ Entomologist // Internet: rfg@ncd.com uucp: ...uunet!lupine!rfg -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
rfg@ncd.com (Ron Guilmette) (12/13/90)
In article <1990Dec12.221425.569@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: > >Actually, I have never figured out why people seem so obsessed with perfect >hashing, since the hash functions typically seem noticeably expensive, and >it has always seemed to me that a fast hashing function plus fast resolution >of occasional collisions would be easier and just as good. Henry's point is well taken, however there are *some* places where I think that it is probably worth the effort to try to find a *perfect* hash function. Keyword tables in compilers and opcode tables in assemblers come to mind. So perfect hash functions may be useful somtimes. Note however that *minimal* perfect hash functions are not only never absolutely *required*, but (as Doug Schmidt pointed out to me) they may actually be *inefficient*!!! The reason is that in a not-quite-minimal (but still perfect) hash function, you will never have any "collisions" (so you don't have to waste time on dealing with such cases) and also, if you happen to look up an input key which does not even have a corresponding entry in your hash table, then you will hash that input key into a hash key whose corresponding primary table entry is empty! In such cases, you actually SAVE TIME (relative to the case where you have a *minimal* perfect hash function) because you won't even have to do a single comparison (of any pair of complete keys). [In a later message, Ron also points out:] See also "The Art of Computer Programming" by Donald Knuth, book 3, section 6.4. -- // Ron Guilmette - C++ Entomologist // Internet: rfg@ncd.com uucp: ...uunet!lupine!rfg -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
bart@cs.uoregon.edu (Barton Christopher Massey) (12/14/90)
In article <9012122107.aa06454@ICS.UCI.EDU> Ira Baxter <baxter@zola.ICS.UCI.EDU> writes: > If you have a lexer independent of the parsing mechanism, so that it > hasn't got any idea whether the next string is a keyword or not > (assemblers scanning the opcode field violate this requirement), then > you'll have to do a symbol/string table probe anyway You can very efficiently decide whether the next input belongs to a given set of keywords or is an identifier as you read it in using a DFA or some similar recognizer. Given the wide availability of FLEX, this is the approach I almost always use -- dodges both the problems of efficient keyword hashing and separating identifiers from keywords. I tend to agree with Henry Spencer that on modern machines, "imperfect hashing" is likely to be good enough -- after all, how many keywords are we talking about, anyway? A couple hundred, max? Let's just increase the table size until some really cheap hash function doesn't generate any collisions. How big a table will we end up with? A couple thousand entries? That's "only" 8K of pointers on a 32-bit box... An interesting exercise would be a program which, for a given set of keywords, and a given table size limit, tried more and more expensive hash functions until the keywords hashed without collisions into the table and then emitted the function. Kind of like gperf with less lofty ambitions :-). Bart Massey bart@cs.uoregon.edu -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
roth@resi.rice.edu (Jerry Roth) (12/15/90)
In article <1990Dec13.201211.6483@cs.uoregon.edu> bart@cs.uoregon.edu (Barton Christopher Massey) writes: >You can very efficiently decide whether the next input belongs to a given >set of keywords or is an identifier as you read it in using a DFA or some >similar recognizer. Given the wide availability of FLEX, this is the >approach I almost always use -- dodges both the problems of efficient >keyword hashing and separating identifiers from keywords. I have always wondered which method is truly more efficient. The two methods are: (a) Preload all keywords into the hash table and have the lexer distinguish identifiers from keywords via lookup. Keywords and identifiers are then treated the same until after the lookup. (b) Add a regular expression for each keyword to the lexer followed by a regular expression to catch identifiers. Keywords are then handled directly while only the identifiers get hashed. It seems method (a) has the advantage of keeping the lexer simple in that it has fewer expressions to attempt to match. While method (b) avoids hashing keywords. Can anyone tell me if adding more expressions to my lexer (one per keyword) causes it to execute less efficiently, assuming that I am using LEX/FLEX to create it. If so, is it less efficient than hashing each keyword when encountered. Thanks, Jerry Roth [It is my impression that the speed at which a flex parser does what it does is unrelated to the complexity of the state machine, although its size goes up, but I'd be glad to hear from someone who knows definitively. -John] -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
oz@nexus.yorku.ca (12/15/90)
In article <2977@lupine.NCD.COM> rfg@ncd.com (Ron Guilmette) writes: > >As others have already noted, Doug Schmidt wrote a cute program called >gperf (in C++) which is useful for finding perfect hash functions. A PD almost-perfect hash generator based on Cicelli's work has been around since early 80s. I include it below. I have used it successfully in a few places. Here is the diag output of the program for C keywords: [note that in this case, it generates a table of 51 entries for 36 keywords] | Perfect hash function finder, CDH Ver. 2.9 | Start time = Fri Dec 14 12:11:42 1990 | 36 keywords, 21 distinct letters. | The associated char value limit is 21 | The table size limit is 100 | The search ordering is | else double case continue float typeof typedef default | const short struct sizeof signed static extern inline int | if for enum volatile void while char register do return | unsigned __alignof switch auto goto union asm long break | | Success! Associated Char Values Follow: | _ =18, a =11, b = 9, c = 1, d = 0, e = 0, f = 3, g = 9, h =15, | i = 2, k =21, l =20, m =18, n =12, o =21, r =21, s =10, t = 5, u =21, | v =20, w =20, | Hash min = 4, max = 50, spread = 47 | else 4, case 5, double 6, if 7, inline 8, | continue 9, int 10, const 11, default 12, float 13, | typeof 14, typedef 15, signed 16, static 17, extern 18, | sizeof 19, short 20, struct 21, enum 22, do 23, | void 24, while 25, char 26, for 27, volatile 28, | unsigned 29, __alignof 30, switch 31, asm 32, long 33, | goto 34, break 35, auto 36, union 38, return 39, | register 50, | | Total search invocations = 13727, max depth = 35 | Started Fri Dec 14 12:11:42 1990 | Finished Fri Dec 14 12:11:47 1990 enjoy... oz ps: If you hack this program to look & work nicer, plz drop me a copy. --- Internet: oz@nexus.yorku.ca, UUCP: utzoo/utai!yunexus!oz [I have added the perfect hash program to the comp.compilers library, since it's kind of big to put in a message. Drop me a line if you want a copy. -John] -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
evan@Sun.COM (Evan Bigall) (12/16/90)
In article <2980@lupine.NCD.COM> rfg@ncd.com (Ron Guilmette) writes:
Henry's point is well taken, however there are *some* places where I think
that it is probably worth the effort to try to find a *perfect* hash
function. Keyword tables in compilers and opcode tables in assemblers
come to mind.
The time when I always find myself looking for a perfect hash function, is
when I am parsing long keyword options (ie: -ansi etc...) or some other case
when I have a small set of possibilities but want to avoid using:
if (!strcmp("hello", arg))
{}
else if (!strcmp("there", arg))
{}
else if (!strcmp("world", arg))
{}
else
error;
Maybe I'm just being silly, but the above code, and the tiny hand crafted
lexers I usually end up both make me gag.
Because these situations often come up several times within one program I'd
much rather put in a quick perfect hash, then have to put in the code to deal
with collisions.
I have one program where I ended up having 15 or so different instances of
this situation. I ended up giving them start conditions and just calling
the (already terribly complicated) flex lexer, but that seems like overkill
for everyday use.
/Evan
--
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
vern@horse.ee.lbl.gov (Vern Paxson) (12/17/90)
In article <1990Dec14.164415.326@rice.edu> roth@resi.rice.edu (Jerry Roth) writes: > Can anyone tell me if adding more expressions to my lexer (one per keyword) > causes it to execute less efficiently, assuming that I am using LEX/FLEX to > create it.... > ... > [It is my impression that the speed at which a flex parser does what it does > is unrelated to the complexity of the state machine, although its size goes > up, but I'd be glad to hear from someone who knows definitively. -John] Right, a finite automaton scanner like flex's looks at each character just once, regardless of the complexity of the patterns it is matching against (*). Adding more regular expressions to the rule set makes the scanner tables bigger but does not change the speed of scanning (except for second-order effects like page faults). So if space is not a concern, it's easiest and fastest to match keywords using a separate rule for each one. Vern Vern Paxson vern@helios.ee.lbl.gov Real Time Systems ucbvax!helios.ee.lbl.gov!vern Lawrence Berkeley Laboratory (415) 486-7504 (*) There are some cases when the scanner must rescan text, though they don't arise when using individual rules to match keywords plus a generic "identifier" rule to match all non-keyword identifiers. flex's -b option helps identify and eliminate these cases. -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
schmidt@oberkampf.ICS.UCI.EDU (Doug Schmidt) (12/17/90)
In article <14101@june.cs.washington.edu> pardo@cs.washington.edu (David Keppel) writes:
++ See also `gperf', a part of the `libg++' distribution from the Free
++ Software Foundation. The g++ libraries are available via anonymous
++ ftp from `prep.ai.mit.edu'. (Please ftp before/after hours if you
++ can!)
++
++ ;-D oN ( Perfect Chowhouse Hash ) Pardo
My paper describing gperf is available in the proceedings of the
USENIX C++ Workshop held in April 1990. The paper explains the
algorithm used to generate perfect hash functions efficiently.
Doug
--
schmidt@ics.uci.edu (ARPA)
office: (714) 856-4043
--
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
preston@ariel.rice.edu (Preston Briggs) (12/17/90)
plx!plxsun!evan@Sun.COM (Evan Bigall) writes: >The time when I always find myself looking for a perfect hash function, is >when I am parsing long keyword options (ie: -ansi etc...) or some other case >when I have a small set of possibilities but want to avoid using: > > if (!strcmp("hello", arg)) > {} > else if (!strcmp("there", arg)) > {} > else if (!strcmp("world", arg)) Another poster (Ira Baxter?) mentioned how to handle these efficiently in a typical compiler-like case. I'll expand a little. Since we're building a symbol table anyway, we're going to need to hash every identifier anyway. I use one hash table to hash keywords, identifiers, and strings as part of my scanner. Each time a new id or string is encountered, I allocate a new integer for it. If the integer is < some small number, an id is actually a keyword. The rest of the front-end simply looks at integers which serve as indices into a string table. Makes it easy to switch on "strings" when all are reduced to integers. This works especially well with Knuth's WEB system. WEB has a notion of preprocessed-strings. I.e. "a string" When Tangling, a string table is maintained and all the strings are replaced in the output text with integers. Further, the string table is written to a pool file, suitable for initializing your compiler's hash table (or TeX or whatever). Pre-processed strings are also a nice way to handle error messages and so forth, potentially saving much storage. LISPers will wonder what the fuss is about, as they use symbols as a matter of course. Preston Briggs -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
brain@eos.ncsu.edu (Marshall Brain) (12/19/90)
Cichelli's work in Perfect Hashing was extended in the paper: "Near-Perfect Hashing of Large Word Sets", by Brain and Tharp, Software - Practice and Experience, Vol. 19, No. 10 (October 1989) pp. 967-978. This algorithm could handle around 1000 words. An extremely simple algorithm, also based on Cichelli's work and able to handle up to 5000 or so words appeared in the paper: "Perfect Hashing Using Sparse Matrix Packing", by Brain and Tharp, Information Systems, Vol. 15, No. 3 (1990), pp. 281-290. -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
rfpfeifl@watcgl.uwaterloo.ca (Ron Pfeifle) (12/22/90)
[Here's another reference.]
>From: "Gordon V. Cormack" <gvcormack@watdragon>
Cormack G.V., Horspool R.N., and Kaiserswerth M.
Practical Perfect Hashing,
Computer Journal 28:1 (1985), 54-58.
--
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.