johnl@ima.UUCP (01/25/87)
Hi, Has anyone implemented a program for finding minimal perfect hashing functions? The best reference that I have for them is CACM May '85 "A Polynomial Time Generator for Minimal Perfect Hash Functions". The algorithm is somewhat lengthy (and messy) so I would really not like to have to re-invent the wheel. If you have such a beastie, could you please send me a copy? Thanks. And maybe mod.sources would like a copy too. Thanks in advance, Tony ;-) [I haven't seen anything since then, but as always encourage submissions from readers. -John] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
johnl@ima.UUCP (01/28/87)
Original-sender: I am curious to know if any really uses this technique (of minimal perfect hashing). The reason I ask is that it involves polynomials which are rather time consuming to compute. Or am I wrong about that? I always use a binary search method since that comsumes less than 10 compares of strings. Usually the compares fail on the first character. So in about 10 add's and divide by two (hopefully done with a shift) followed by a compare, you are done. That's pretty quick. Also, the keyword list can change easily without much rewrite of the code. What does the typical "real" polynomial come out to be? Does it really take less time than the binary search method. (I would judge this based upon a miss of the table since most identifiers in a program are not keywords.) -- Perry Smith pedz@bobkat {ti-csl,infotel}!pollux!bobkat!pedz -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
johnl@ima.UUCP (02/03/87)
In response to the article suggesting that binary search might be more efficient than a minimal perfect hashing function for keyword search: You may be right about the superiority of binary search over the use of hashing in a keyword table -- it depends crucially on several details which most people tend to overlook when they describe their favorite keyword-determination algorithm. For instance, when you do your "up to 10 string compares", do you call a function each time? If so, it almost certainly loses to a good hashing scheme. (Actually, 10 compares seems too high, since the size of the average keyword table is probably about 50 items, but even 6 compares might be slower than hashing.) As for hashing schemes, the efficiency issues include: time to calculate the hash function; number of keywords that collide ("perfectness"); expected ratio of non-keywords to keywords; density of keywords in the table ("minimalness"). Although it is obviously an advantage to have a perfect hashing function, because there need be no mechanism to resolve collisions, it actually WASTES time to have a minimal sized table, because every non-keyword will then collide with a keyword when hashed, requiring a string compare to verify that it is not a keyword. Unless you are cramped for space, or most of your identifiers are expected to be keywords, the best density for your hash table is less than 1/5 full, so that 4 out of 5 non-keyword identifiers will hash to an empty location, which can be instantly recognized as such without a string compare. Given that you don't want a MINIMAL hash function, the construction of a PERFECT, very fast one is not hard at all. This is how I did it for two different languages: With a list of keywords in front of you, find two or three trivally computable attributes of an identifier that distinguish all the keywords from one another. These attributes can almost always be chosen from the following: first, second or third letters; identifier length; last or second last letters. This should take you about half an hour at most. Now, let's say you decide to use first letter, third letter and identifier length. The next decision is to choose a table size large enough to make most identifiers hash to empty positions, say 257 (a prime number is best). The only remaining task is to write a program that finds values for the variables (i, j, k) in the function below, such that there are no collisions: len = strlen(id); hash = ( i * id[0] + j * id[len>=3 ? 2 : len] + k * len ) % 257; The program can start with, say, (i,j,k) = (1,10,40) and iterate through increasing values in any order you choose. Make sure, however, that i,j and k are large enough to allow at least 257 different values to be pro- duced for the mod (%) operation. You will find that, with less than 50 keywords and a density of less than 1/5, you don't have to try many combinations of i,j and k to hit a perfect hash. My program ran for a couple of minutes in both cases. The above hash function takes three multiply's, two adds, a mod, and a check on the length of the id. Unlike the binary search method, no additional string compare is necessary for 4 out of 5 non-keyword identifiers, although every keyword requires a string compare. (I assume that you would already have in hand the length, "len", of the identifier, if you care at all about efficiency, so I do not include the time to calculate strlen as part of the hash. In the typical lexer that reads identifiers into a token buffer, it can be computed in one instruction by subtracting the token buffer address from the pointer to the end of the token in the buffer.) My program to find a perfect hash function as described above is all of 93 lines of C, including reading the list of keywords from a file into an array, and printing out the values of i, j, k and the hashed values for each keyword. If anyone thinks it worthwhile, I'd be happy to mail it to them. Michael Condict {ihnp4|vax135}!m10ux!mnc AT&T Bell Labs (201)582-5911 MH 3B-416 Murray Hill, NJ -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request