leichter@yale-com.UUCP (Jerry Leichter) (10/04/83)
There is work of two sorts done in this direction. First, there are some articles on generating PERFECT hash functions for given sets of inputs. A perfect hash function has no collisions at all. You would use something like this to produce a very quick hash-table lookup for keywords in a compiler. You can find these articles in the CACM within the last 5 years or so. Second, there is a general approach, due to Carter and Wegman. They introduce "universal classes of hash functions". Such a class has the property that a randomly chosen member is highly likely to give a very good hash ("highly likely" and "very good" having specific meanings in this context) for ANY particular set of inputs to be hashed. The only place I know of in which this has been written up is in Proceedings of the 9th SIGACT (1977), page 106. They prove some practical classes of functions - essentially of the form Ax+B mod C, where C is fixed for the class and A and B vary to give all class members - are universal. Beyond this, getting any sort of guarantees of the quality of a hash function for a given, random set of inputs is extremely difficult. -- Jerry decvax!yale-comix!leichter leichter@yale
condict@csd1.UUCP (Michael Condict) (10/05/83)
Yes, there is are several ways to generate such hash functions. One such method was reported in Communications a few years back (1979?) in an article by my old friend Richard Cichelli (sorry, Richard, you're not that old). If he is on the net I am sure he will be happy to respond to your request. Meanwhile I'll tell you what I can remember. The assumption is slightly different from what you stated in that we allow the hash function to produce any output it wants (between 1 and n), with the objective, of course, of minimizing collisions and maximizing density of the hash table. So the input to the generator is a list of character strings and an allowed range 1..n, rather than a list of string/number pairs. A hash function is said to be perfect if it produces no collisions over the entire set of allowed inputs. It is said to be minimal if it produces a completely dense hash table, that is, if its set of outputs form a contiguous run of numbers. As I recall, Cichelli's method would produce a minimal perfect hash function in a quite reasonable amount of time (a few seconds or minutes to get one for the 32 Pascal reserved words, using a PDP/11). The form of the hash function was very simple. It consisted of h(str) = code1[str[1]] + code2[str[strleng]] + strleng; where str is the array of characters that is the input to the hash function, strleng is its length, and code1, code2 are two arrays that are produced by the generator. They are of type: array [char] of integer; Their purpose is to convert the first and last characters of the string into numbers that, when added with the string length will produce a unique number for each string. Of course one could use the 2nd, 3rd or other characters in str as well, but remember that only the first and last are guaranteed to exist. The problem, clearly, is that for some sets of inputs the combination of first char, last char and length is not unique, in which case the method will fail to produce a perfect hash function. Sorry I cannot remember the algorithm at all, but this should tell you if it is what you are looking for, and it shouldn't be hard to find in CACM. Michael Condict ...!cmcl2!csd1!condict New York U.
charlie@genrad.UUCP (Charlie Havener) (10/05/83)
I have implemented the Cichelli Algorithm for finding Minimal Perfect Hashing Functions in a C program under UNIX 4.1BSD. Is there enough interest for me to post it to net.sources? Send me mail ...!decvax!genrad!charlie