NU013809@NDSUVM1.BITNET (Greg Wettstein) (01/20/89)
I would like to thank everyone who took time to respond to my question about the use of 'HASH'ing and the fundamental algorithms and techniques involved. I received numerous replies and several examples of codes. I will attempt to paraphrase the salient points of the replies that I did receive. The primary purpose of hashing is to implement extremely rapid table lookup of symbols or character strings. The process of 'HASH'ing a string is the translation of the string into a (hopefully) unique numerical value. The goal of this mapping of an input domain into an output domain is to generate a unique index which can be used to rapidly locate the string at some later time. The selection of an algorithm to implement the mapping process is extremely critical to the implementation of a HASH based search technique. If the input domain is explicitly defined it is possible to develop a mapping (HASH) function which will always generate a unique numerical value for each member of the input domain. In fact Knuth comments in his Volume 3 (Searching and Sorting) that such a task is an interesting problem. In practice these perfect HASH functions are seldom utilized since the input domain is seldom if ever explicitly defined. If a perfect 'HASH'ing function is not utilizable the 'HASH'ing and indexing algorithms must implement some type of collision resolution mechanism. Collision is the term used to define the instance where two members of the input domain map to the same member of the output domain. As a brief aside it is interesting to note that the problem of collision makes a CRC based file protection scheme always suspect. This problem was discussed extensively in the MS-DOS and IBM-PC newsgroups when programs such as FLU-SHOT were being analyzed. Since the cyclic redundancy checksum methods are not guaranteed to be collision proof there is always the possibility that two programs can be made to generate the same CRC checksum. Resolution of the collision problem usually involves the employment of some type of linear or linked list search technique for all members of the input domain which map to an identical output value. The output domains in this instance are typically referred to as 'buckets'. When a string is hashed using this collision avoidance technique members of the output bucket must be searched to insure that the desired table entry matches the input string. The advantage of a HASH based search technique is that it is much faster to search a linear list in each bucket than to do a linear search through all members of the input domain. A common question when implementing a HASH based lookup table is how does one translate a supposedly random number into a finite set of 'buckets' which are typically the indexes into the lookup table. The technique used here is to divide the HASH value by a prime number and take the modulus of the division to be the bucket number or array index. The overall goal in the process of hashing and bucket index generation is to develop a process which disperses the input keys as evenly as possible among all the output buckets. The selection of an appropriate algorithm is influenced by factors such as the relative similarity and structure of the input keys. The following is a typical algorithm used for a table lookup which embodies a collision avoidance technique. Assume that the HASH function is used to generate an index into an array. Each member of the array is a pointer to a structure which contains the string, some parameter which is to be associated with the string and a pointer to another structure which contains a different string but which generated the same table index, 'colliding strings' if you will. 1.) Translate the input string into a numerical value using the 'HASH'ing function. 2.) Divide the numerical value obtained in step one by a prime number to yield a 'bucket' value or array index. 3.) Use the value obtained in step two as in index into the array of structure pointers. Compare the string value in the first structure with the input string to see if the two values match. If they do the search is successful and complete. 4.) If the comparison in step three fails the pointer in the structure should be used to follow the the linked list of structures until a matching string is found. If the end of the list is encountered the table search fails. The preceeding algorithm is of course a very cursory one but does give a general outline of the procedures employed. Almost all the responses I received recommeneded a number of good references which can be pursued if anyone has any further interest. The references suggested are as follows: 1.) Data Structures and Algorithms by Aho, Hopcroft and Ullman 2.) Second Edition of C by Kernighan and Ritchie 3.) The Art of Computer Programming: Volume 3 Sorting and Searching by Knuth 4.) Algorithms by Sedgewick 5.) Compilers: Principles, Techniques and Tools by Aho, Sethi and Ullman 6.) Data Structures, Algorithms and Program Style Using C by Korsh and Garrett. Once again I would like to offer my sincere thanks to everyone who took time to respond. I received a great deal of informative information and a number of the preceeding books spent a fair amount of time on the end table during the last couple of weeks. Once again thanks to everyone for the information. As always, G.W. Wettstein BITNET: NU013809@NDSUVM1
ray@micomvax.UUCP (Ray Dunn) (01/26/89)
As Greg's posting may be taken as "definitive" and squirrelled away by readers for future reference, it may be best to correct a couple of small points: In article <1761NU013809@NDSUVM1> NU013809@NDSUVM1.BITNET (Greg Wettstein) writes: >A common question when implementing a HASH based lookup table is how >does one translate a supposedly random number into a finite set of >'buckets' which are typically the indexes into the lookup table. The >technique used here is to divide the HASH value by a prime number and >take the modulus of the division to be the bucket number or array index. In fact the "most common" method, is to employ a table whose length is a power of 2. The "index" is then generated by simply ANDing the hash value by the table size-1. >The overall goal in the process of hashing and bucket index generation >is to develop a process which disperses the input keys as evenly as >possible among all the output buckets. This is a very common mistake made when designing hashing algorithms. In fact the goal should be to minimize the *total* time spent in the searching algorithm over a statistically significant number of calls. This involves taking the *frequency of occurrence* of symbols or classes of symbols into account when designing the mechanism. If the tokens "( ) ; ," each occur significantly more frequently than the tokens "while for switch do", then it would be reasonable to design the algorithm so that the former items were located faster than the latter, i.e. the length of the search chains would be inversely proportional to the frequency of occurrence of the tokens on the chains. This can be achieved in a variety of ways, including ad hoc tests for specific high frequency items prior to applying a generalized routine. In the case of a fixed data-base, (for example the reserved words in a compiler), it does not take much effort to analyze a filesystem or three for token frequencies and then structure and *order* the dictionary accordingly (i.e. order the linked lists with the most frequent tokens first). It is also important to remember that a naive algorithm may well prove just as good overall than a *time consuming* sophisticated one, particularly if the search chains are short. Equate the time taken by the algorithm to extra "dead" length on the search chains. This was all debated in this newsgroup as recently as August of last year. -- Ray Dunn. | UUCP: ..!philabs!micomvax!ray Philips Electronics Ltd. | TEL : (514) 744-8200 Ext: 2347 600 Dr Frederik Philips Blvd | FAX : (514) 744-6455 St Laurent. Quebec. H4M 2S9 | TLX : 05-824090