[comp.lang.c] 'HASH'ing techniques.

NU013809@NDSUVM1.BITNET (Greg Wettstein) (12/27/88)

In following discussions in comp.lang.c I have frequently seen discussions
involving 'hashing' techniques and algorithms.  These discussions were
particularly prevalent when the shift ( << >> ) operators were being discussed.

While I have done a fair amount of programming in C I have never utilized
'hashing' algorithms in any of my applications up until now.  I was given a
piece of application code just before the holiday season which needs some work
to smooth out some bugs which user's have been reporting.  One module of the
code is heavily dependent on a function which purports to 'HASH' a series of
input strings.

As near as I can tell from the code the 'hashing' function seems to generate
a unique numerical value for each string.  Is this the intended purpose of
hashing or have I misinterpreted the code?  The code seems to be dependent on
shifting an input character and on a table of prime numbers.

I would be interested in whatever information the net could provide on hashing
and HASH algorithms.  E-mail or posts to the net would be just fine.  I will
be happy to summarize any e-mail responses I receive.  Thanks in advance for
any primers the net is willing to supply.

                                               As always,
                                               G.W. Wettstein
                                               BITNET: NU013809@NDSUVM1

chris@mimsy.UUCP (Chris Torek) (12/29/88)

In article <1687NU013809@NDSUVM1> NU013809@NDSUVM1.BITNET (Greg Wettstein)
writes:
>I would be interested in whatever information the net could provide
>on hashing and HASH algorithms.

No matter how much you find from the net, there is more in Knuth,
vol. ?, *Sorting and Searching*.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

gregg@ihlpb.ATT.COM (Wonderly) (12/30/88)

From article <1687NU013809@NDSUVM1>, by NU013809@NDSUVM1.BITNET (Greg Wettstein):
>In following discussions in comp.lang.c I have frequently seen discussions
>involving 'hashing' techniques and algorithms.  These discussions were
>particularly prevalent when the shift ( << >> ) operators were being discussed.
> 
>While I have done a fair amount of programming in C I have never utilized
>'hashing' algorithms in any of my applications up until now.  I was given a
>piece of application code just before the holiday season which needs some work
>to smooth out some bugs which user's have been reporting.  One module of the
>code is heavily dependent on a function which purports to 'HASH' a series of
>input strings.
> 
>As near as I can tell from the code the 'hashing' function seems to generate
>a unique numerical value for each string.  Is this the intended purpose of
>hashing or have I misinterpreted the code?  The code seems to be dependent on
>shifting an input character and on a table of prime numbers.

The purpose of hashing is to provide a mapping from a very large in
range, but sparsely populated set to a smaller set in such a way as to
make it both efficient and practical to deal with the sparse data.

Strings happen to be a good example (if not the classic) of a sparse
set.  The classic method of hashing is to 'fold' the string into an
integer by using each character's numeric representation, ASCII, EBCDIC
or other, as a factor of an equation.  Typically, the equation is
recursive in that one of the factors is the previous value of the equation.
Put more simply, and in C

	value = 0;

	while (*s)
		value += *s++;

This fragment merely adds the characters together, yielding an integer
value which is unique (that's the important part) for a particular
string pointed to by 's'.  This equation is rather simple, and can yield
poorly distributed mappings.  However, in the general case it is enough
to do a respectable job.  You can use shifting and bit masking to make
sure that degenerate cases, those where the strings contain the same
characters, still yield varying values.  e.g.  the above method yields
the same value for all of the strings,

	string tsring gnirts gnitsr

and any other string with exactly those characters.

The use of this integer value is the next problem.  There are several
things that this integer can represent.  The most classic is an array
index, in which case the modulo operatation is applied to the value using
the size of the array.  The result is then an index into an array, which
is within the bounds of the array.

The problem with hashing is that the mapping from set to set is onto,
but not one-to-one.  Therefore, more than one element from the original
set may map onto the same element (array index in this example) of the
resulting set (called a collision).

There are many different was to handle collisions.  If you know how
many items will be in the original set, you can make you result array
just big enough to hold all the items.  When a collision occurs, you
apply the equation

	value' = (value + n) % m

to value to get value' which is a different, but well defined, value,
until you find an empty slot.  'n' is chosen to be prime to 'm' (the
array size) so that all elements of the array are examined before there is
any repetition.

It turns out that this is fairly good in terms of performance until the
array gets to be about 60-70% full.  At this point, the collision resolution
starts costing a lot.

The better if not best method is to make the array be an array of
pointers (this method is called chaining) which make up a linear linked
list.  Whether or not a collision occurs, the item is inserted at the
beginning (the list may also be ordered) of the list.

The big plus here is that collisions no longer cause degenerate behavior
for all items inserted in the future.  Only the items on a particular
list will suffer performance hits.

There is a lot more to hashing including how it can be used on secondary
storage to guarantee one and only one disk access.  If you really want
a lot more information, see the textbooks...


-- 
It isn't the DREAM that NASA's missing...  DOMAIN: gregg@ihlpb.att.com
It's a direction!                          UUCP:   att!ihlpb!gregg