[comp.lang.c] HASHING, revisited - A summary.

NU013809@NDSUVM1.BITNET (Greg Wettstein) (01/20/89)

I would like to thank everyone who took time to respond to my question
about the use of 'HASH'ing and the fundamental algorithms and techniques
involved.  I received numerous replies and several examples of codes.  I
will attempt to paraphrase the salient points of the replies that I did
receive.

The primary purpose of hashing is to implement extremely rapid table
lookup of symbols or character strings.  The process of 'HASH'ing a
string is the translation of the string into a (hopefully) unique
numerical value.  The goal of this mapping of an input domain into an
output domain is to generate a unique index which can be used to rapidly
locate the string at some later time.

The selection of an algorithm to implement the mapping process is
extremely critical to the implementation of a HASH based search
technique.  If the input domain is explicitly defined it is possible to
develop a mapping (HASH) function which will always generate a unique
numerical value for each member of the input domain.  In fact Knuth
comments in his Volume 3 (Searching and Sorting) that such a task is an
interesting problem.  In practice these perfect HASH functions are
seldom utilized since the input domain is seldom if ever explicitly
defined.

If a perfect 'HASH'ing function is not utilizable the 'HASH'ing and
indexing algorithms must implement some type of collision resolution
mechanism.  Collision is the term used to define the instance where two
members of the input domain map to the same member of the output domain.

   As a brief aside it is interesting to note that the problem of
   collision makes a CRC based file protection scheme always suspect.
   This problem was discussed extensively in the MS-DOS and IBM-PC
   newsgroups when programs such as FLU-SHOT were being analyzed.
   Since the cyclic redundancy checksum methods are not guaranteed to be
   collision proof there is always the possibility that two programs can
   be made to generate the same CRC checksum.

Resolution of the collision problem usually involves the employment of
some type of linear or linked list search technique for all members of
the input domain which map to an identical output value.  The output
domains in this instance are typically referred to as 'buckets'.  When a
string is hashed using this collision avoidance technique members of the
output bucket must be searched to insure that the desired table entry
matches the input string.  The advantage of a HASH based search technique
is that it is much faster to search a linear list in each bucket than to
do a linear search through all members of the input domain.

A common question when implementing a HASH based lookup table is how
does one translate a supposedly random number into a finite set of
'buckets' which are typically the indexes into the lookup table.  The
technique used here is to divide the HASH value by a prime number and
take the modulus of the division to be the bucket number or array index.
The overall goal in the process of hashing and bucket index generation
is to develop a process which disperses the input keys as evenly as
possible among all the output buckets.  The selection of an appropriate
algorithm is influenced by factors such as the relative similarity and
structure of the input keys.

The following is a typical algorithm used for a table lookup which
embodies a collision avoidance technique.  Assume that the HASH function
is used to generate an index into an array.  Each member of the array is
a pointer to a structure which contains the string, some parameter which
is to be associated with the string and a pointer to another structure
which contains a different string but which generated the same table
index, 'colliding strings' if you will.

   1.) Translate the input string into a numerical value using the
       'HASH'ing function.

   2.) Divide the numerical value obtained in step one by a prime number
       to yield a 'bucket' value or array index.

   3.) Use the value obtained in step two as in index into the array of
       structure pointers.  Compare the string value in the first
       structure with the input string to see if the two values match.
       If they do the search is successful and complete.

   4.) If the comparison in step three fails the pointer in the
       structure should be used to follow the the linked list of
       structures until a matching string is found.  If the end of the
       list is encountered the table search fails.

The preceeding algorithm is of course a very cursory one but does give a
general outline of the procedures employed.  Almost all the responses I
received recommeneded a number of good references which can be pursued
if anyone has any further interest.  The references suggested are as
follows:

   1.) Data Structures and Algorithms by Aho, Hopcroft and Ullman

   2.) Second Edition of C by Kernighan and Ritchie

   3.) The Art of Computer Programming: Volume 3 Sorting and Searching
       by Knuth

   4.) Algorithms by Sedgewick

   5.) Compilers: Principles, Techniques and Tools by Aho, Sethi and
       Ullman

   6.) Data Structures, Algorithms and Program Style Using C by Korsh
       and Garrett.


Once again I would like to offer my sincere thanks to everyone who took
time to respond.  I received a great deal of informative information and
a number of the preceeding books spent a fair amount of time on the end
table during the last couple of weeks.  Once again thanks to everyone
for the information.


                                         As always,
                                         G.W. Wettstein
                                         BITNET: NU013809@NDSUVM1

ray@micomvax.UUCP (Ray Dunn) (01/26/89)

As Greg's posting may be taken as "definitive" and squirrelled away by readers
for future reference, it may be best to correct a couple of small points:

In article <1761NU013809@NDSUVM1> NU013809@NDSUVM1.BITNET (Greg Wettstein) writes:
>A common question when implementing a HASH based lookup table is how
>does one translate a supposedly random number into a finite set of
>'buckets' which are typically the indexes into the lookup table.  The
>technique used here is to divide the HASH value by a prime number and
>take the modulus of the division to be the bucket number or array index.

In fact the "most common" method, is to employ a table whose length is a
power of 2.  The "index" is then generated by simply ANDing the hash value
by the table size-1.

>The overall goal in the process of hashing and bucket index generation
>is to develop a process which disperses the input keys as evenly as
>possible among all the output buckets.

This is a very common mistake made when designing hashing algorithms.
In fact the goal should be to minimize the *total* time spent in the
searching algorithm over a statistically significant number of calls.  This
involves taking the *frequency of occurrence* of symbols or classes of
symbols into account when designing the mechanism.

If the tokens "( ) ; ," each occur significantly more frequently than the
tokens "while for switch do",  then it would be reasonable to design the
algorithm so that the former items were located faster than the latter, i.e.
the length of the search chains would be inversely proportional to the
frequency of occurrence of the tokens on the chains.

This can be achieved in a variety of ways, including ad hoc tests for
specific high frequency items prior to applying a generalized routine.

In the case of a fixed data-base, (for example the reserved words in a
compiler), it does not take much effort to analyze a filesystem or three for
token frequencies and then structure and *order* the dictionary accordingly
(i.e. order the linked lists with the most frequent tokens first).

It is also important to remember that a naive algorithm may well prove just
as good overall than a *time consuming* sophisticated one, particularly if
the search chains are short.  Equate the time taken by the algorithm to
extra "dead" length on the search chains.

This was all debated in this newsgroup as recently as August of last year.
-- 
Ray Dunn.                      |   UUCP: ..!philabs!micomvax!ray
Philips Electronics Ltd.       |   TEL : (514) 744-8200   Ext: 2347
600 Dr Frederik Philips Blvd   |   FAX : (514) 744-6455
St Laurent. Quebec.  H4M 2S9   |   TLX : 05-824090