NU013809@NDSUVM1.BITNET (Greg Wettstein) (12/27/88)
In following discussions in comp.lang.c I have frequently seen discussions involving 'hashing' techniques and algorithms. These discussions were particularly prevalent when the shift ( << >> ) operators were being discussed. While I have done a fair amount of programming in C I have never utilized 'hashing' algorithms in any of my applications up until now. I was given a piece of application code just before the holiday season which needs some work to smooth out some bugs which user's have been reporting. One module of the code is heavily dependent on a function which purports to 'HASH' a series of input strings. As near as I can tell from the code the 'hashing' function seems to generate a unique numerical value for each string. Is this the intended purpose of hashing or have I misinterpreted the code? The code seems to be dependent on shifting an input character and on a table of prime numbers. I would be interested in whatever information the net could provide on hashing and HASH algorithms. E-mail or posts to the net would be just fine. I will be happy to summarize any e-mail responses I receive. Thanks in advance for any primers the net is willing to supply. As always, G.W. Wettstein BITNET: NU013809@NDSUVM1
chris@mimsy.UUCP (Chris Torek) (12/29/88)
In article <1687NU013809@NDSUVM1> NU013809@NDSUVM1.BITNET (Greg Wettstein) writes: >I would be interested in whatever information the net could provide >on hashing and HASH algorithms. No matter how much you find from the net, there is more in Knuth, vol. ?, *Sorting and Searching*. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
gregg@ihlpb.ATT.COM (Wonderly) (12/30/88)
From article <1687NU013809@NDSUVM1>, by NU013809@NDSUVM1.BITNET (Greg Wettstein): >In following discussions in comp.lang.c I have frequently seen discussions >involving 'hashing' techniques and algorithms. These discussions were >particularly prevalent when the shift ( << >> ) operators were being discussed. > >While I have done a fair amount of programming in C I have never utilized >'hashing' algorithms in any of my applications up until now. I was given a >piece of application code just before the holiday season which needs some work >to smooth out some bugs which user's have been reporting. One module of the >code is heavily dependent on a function which purports to 'HASH' a series of >input strings. > >As near as I can tell from the code the 'hashing' function seems to generate >a unique numerical value for each string. Is this the intended purpose of >hashing or have I misinterpreted the code? The code seems to be dependent on >shifting an input character and on a table of prime numbers. The purpose of hashing is to provide a mapping from a very large in range, but sparsely populated set to a smaller set in such a way as to make it both efficient and practical to deal with the sparse data. Strings happen to be a good example (if not the classic) of a sparse set. The classic method of hashing is to 'fold' the string into an integer by using each character's numeric representation, ASCII, EBCDIC or other, as a factor of an equation. Typically, the equation is recursive in that one of the factors is the previous value of the equation. Put more simply, and in C value = 0; while (*s) value += *s++; This fragment merely adds the characters together, yielding an integer value which is unique (that's the important part) for a particular string pointed to by 's'. This equation is rather simple, and can yield poorly distributed mappings. However, in the general case it is enough to do a respectable job. You can use shifting and bit masking to make sure that degenerate cases, those where the strings contain the same characters, still yield varying values. e.g. the above method yields the same value for all of the strings, string tsring gnirts gnitsr and any other string with exactly those characters. The use of this integer value is the next problem. There are several things that this integer can represent. The most classic is an array index, in which case the modulo operatation is applied to the value using the size of the array. The result is then an index into an array, which is within the bounds of the array. The problem with hashing is that the mapping from set to set is onto, but not one-to-one. Therefore, more than one element from the original set may map onto the same element (array index in this example) of the resulting set (called a collision). There are many different was to handle collisions. If you know how many items will be in the original set, you can make you result array just big enough to hold all the items. When a collision occurs, you apply the equation value' = (value + n) % m to value to get value' which is a different, but well defined, value, until you find an empty slot. 'n' is chosen to be prime to 'm' (the array size) so that all elements of the array are examined before there is any repetition. It turns out that this is fairly good in terms of performance until the array gets to be about 60-70% full. At this point, the collision resolution starts costing a lot. The better if not best method is to make the array be an array of pointers (this method is called chaining) which make up a linear linked list. Whether or not a collision occurs, the item is inserted at the beginning (the list may also be ordered) of the list. The big plus here is that collisions no longer cause degenerate behavior for all items inserted in the future. Only the items on a particular list will suffer performance hits. There is a lot more to hashing including how it can be used on secondary storage to guarantee one and only one disk access. If you really want a lot more information, see the textbooks... -- It isn't the DREAM that NASA's missing... DOMAIN: gregg@ihlpb.att.com It's a direction! UUCP: att!ihlpb!gregg