djones@megatest.UUCP (Dave Jones) (11/07/90)
I did a little bit of experimentation with the hash-functions which have been posted here. Nothing scientific, mind you, but very interesting. I wrote a program which takes a set of strings and stores each in a hash-table. (Lookups are equivalent to stores, for our purposes, because both operations hash a key and then look for it using rehashing.) I made two versions, each based on my standard library hash-package, one using the simple little add-and-shift function that I posted here, the other using the CRC-16 algorithm. I then ran each on various data, including source files and the UNIX spelling-checker dictionary "/usr/dict/words". The result was that the simpler hash function consistently gave much better store/lookup times that the CRC version did. Not even close. Interestingly, the simpler hash-function was not only much quicker to calculate, but it also gave far fewer collisions. Using the /usr/dict/words data, each hash-function was called 24473 times -- once for each word in the file. The shift-and-add version called the string comparison routine 39655 times, which represents a collision rate of about 0.6 collisions per lookup, which is right around the theoretical rate for this particular algorithm. But the CRC-16 function called the string-comparison routine no fewer than 270,662 times, for a collision- rate of about 10.0 collisions (and re-collisions) per lookup! I think I can rationalize that. The CRC-16 function does indeed spread the set of all strings uniformly over the interval 0 .. 2**16 - 1. But for n < 16 it does not uniformly spread short strings over the interval 0 .. 2**n - 1 in the bottom n bits. Also consider that in ASCII (w/o parity) we are really only dealing with bit-streams in which every eighth bit is a zero -- a very regular pattern for a 16-bit checksum. That may have something to do with it, I dunno. A little experimenting suggests that it may be possible to devise better polynomials than the CRC-16. I got one that seems to work pretty good. But I don't think it can distribute the keys well enough to compensate for the extra time it takes to calculate a check-sum. (Besides that, any hash-function which does not at least take into account the range of the printable characters probably will distribute typical text-strings less than optimally, I suspect.)
djones@megatest.UUCP (Dave Jones) (11/08/90)
I just now tested some hash-functions for an application which is probably more typical, namely a compiler. The simplest function is still a big winner on all counts. I ran tests on a standard Pascal header-file containing 6290 identifiers, including many duplicates (such as "procedure"). I got 370 collisions using my hash-function, for a rate of only 0.06 collisions per lookup! That was with a real big table. Lots of empty slots. When I reduced the table size so that only about half the slots were empty, I got a little under one collision per lookup, which is theoretical (1/2 + 1/2*1/2 + ...). Under these circs, anything around one should generally be considered pretty good. Surprisingly the CRC-16 hash-function was not too bad in this test, giving a collision-rate of 1.24. Why it did so very poorly on the dictionary, I don't know. And of course the relative slowness of calculating it is also important, so I don't think it has anything at all going for it as far as hashing char-strings is concerned. I'm wondering why it was even suggested.
oz@yunexus.yorku.ca (Ozan Yigit) (11/13/90)
In article <14461@goofy.megatest.UUCP> djones@megatest.UUCP [Dave Jones] writes: >I just now tested some hash-functions for an application which is >probably more typical, namely a compiler. The simplest function is >still a big winner on all counts. Dave, you may also wish to take a look at the following. %A B. J. McKenzie %A R. Harries %A T. Bell %T Selecting a Hashing Algorithm %J Software - Practice and Experience %V 20 %N 2 %D Feb 1990 %P 209-224 This interesting article evaluates some of the more commonly known hash functions from a few sources: Amsterdam Compiler Kit, Eth Modula-2 Cross Compiler, Gnu-cpp, Gnu-cc1, pcc (4.3bsd), C++ (AT&T) and Icon. Their conclusions suggest that the algorithms of the style h(0) = 0; h(i) = k * h(i-1) + ch(i) [for 1 <= i <= n] hash = h(n) MOD N perform well provided k (a constant multiplier) and N (the table size) are chosen well. They feel that most implementation of hash functions contain far-from-optimal choices [so, what else is new?]. They also state: The experiments have shown that very small variations in N can produce large variations in the efficiency of the hash-table look-up, and that the popular view, that choice of a prime number will automatically ensure a good result, is not well founded. [pp 224] They found that one of the simplest hashing functions (gnu-cpp) performed well, both in speed and number of probes: [included verbatim from gnu-cpp. k = 4, N = 1403 odd but not prime] #define HASHSIZE 1403 #define HASHSTEP(old, c) ((old << 2) + c) #define MAKE_POS(v) (v & ~0x80000000) /* make number positive */ hashf (name, len, hashsize) register U_CHAR *name; register int len; int hashsize; { register int r = 0; while (len--) r = HASHSTEP (r, *name++); return MAKE_POS (r) % hashsize; } [ps: I have not yet tested this function, so I do not know if their results are easily reproducible. For example, would making r unsigned, and avoiding MAKE_POS crock slow this function down?] enjoy.. oz --- First learn your horn and all the theory. Internet: oz@nexus.yorku.ca Next develop a style. Then forget all that uucp: utzoo/utai!yunexus!oz and just play. Charlie Parker [?] York U. CCS: (416) 736 5257