torek@elf.ee.lbl.gov (Chris Torek) (06/22/91)
In an article whose referent was deleted by bogus fidonet software (and whose tabs were changed to spaces), I provided this example hashing function: >> hash = 0; >> for (each character) >> hash = (hash * 33) + (the character); >> hash_index = hash & ((some power of two) - 1); In article <14491.28619071@stjhmc.fidonet.org> Dave.Harris@f14.n15.z1.fidonet.org (Dave Harris) writes: >Is the *33 use to keep short sequences has key more random or does it >really help in the distribution of large string too? I am not sure what you are asking here. There are basically three questions you can ask about a hash function: a) Is it fast enough? (complicated functions often take longer than the searches they replace) b) Does it produce a good distribution? (the hash function h(input)=0 is very fast but produces horrible distributions) c) If (b) is true, why does it work? The first two can really only be answered in some specific context. In all the contexts I have tried, this one answers `yes' to both (which is why I recommend it). The third question is the hardest. What *does* that 33 do? I have no idea. All I know is that I have experimented with `easy' functions (functions for which `fast enough' should generally be true) and this one also produces a good distribution. I doubt I still have the mail, but when Margo Seltzer was writing the database code to replace ndbm, she tried a bunch of different functions on a bunch of different data. This one usually worked well---it did not always give the best distributions, but it never performed poorly; some functions that gave better distributions on some data gave much worse on other. Maybe there are some theory people out there who can explain it. Probably some staring at Knuth vol. 3 would help. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
kpv@ulysses.att.com (Phong Vo[drew]) (06/24/91)
Somewhere in this thread, Chris Torek proposed as a hash function: > >> hash = 0; > >> for (each character) > >> hash = (hash * 33) + (the character); > >> hash_index = hash & ((some power of two) - 1); > Chris then asked, > ... What *does* that 33 do? > > Maybe there are some theory people out there who can explain it. > Probably some staring at Knuth vol. 3 would help. A good property for a hash function is if you keep feeding it a constant character, it performs as if it's a random number generator. The hash function f(h,c) = 33*h + c is derived from a class of random number generators called linear congruential. At the start of Knuth Vol 2, there is a good discussion of these generators. Let's apply some of the theorems in that section and see what we get. Keep in mind that we are working with random numbers mod 2^x where x is usually 16 or 32. Assuming that c is some fixed odd value, Theorem A guarantees that 33 is good because (33-1)%4 == 0. In the same reasoning, 31 is not good since (31-1)%4 == 2 != 0. Here good means that if you keep feeding c, you'll cycle thru the entire set of values [0,2^x). On the other hand, 33 is not a primitive element for 2^x when x > 3 (Theorem C), so it isn't a good multiplicator if you keep feeding in 0's. For example, 37 is a better value from that point of view. To summarize, a good multiplicator is any number with 101 as their low order 3 bits. At this point, we depart from theory. We typically hash byte strings and we want the bits to mix in the 16-bit or 32-bit registers relatively fast. This suggests that some value larger than 2^8 (but not too large) may do better as a multiplicator. After a little experimentation, here's a pretty decent hash function: hash = 0; for(each byte c) hash = (hash<<8) + (hash<<2) + hash + c; This corresponds to the multiplicator 261 whose bit pattern is 100000101. Try it, you'll like it.
torek@elf.ee.lbl.gov (Chris Torek) (06/24/91)
In an article whose referent is long gone, I described this hash function: for (each character) hash = (hash * 33) + (the character); hash_index = hash & ((some power of two) - 1); and noted: >>Maybe there are some theory people out there who can explain [why 33 >>works well]. In article <15027@ulysses.att.com> kpv@ulysses.att.com (Phong Vo[drew]) writes: >A good property for a hash function is if you keep feeding it >a constant character, it performs as if it's a random number generator. >The hash function f(h,c) = 33*h + c is derived from a class of random >number generators called linear congruential. At the start of Knuth Vol 2, >there is a good discussion of these generators. ... and indeed, if you turn to Knuth vol. 2, chapter 3, you will find all sorts of interesting stuff, such as a proof that the low-order n bits of the usual rand() repeat with period $2^n$. (rand() uses a modulus of either $2^15$ or $2^31$.) >... Assuming that c is some fixed odd value, Theorem A guarantees that >33 is good because (33-1)%4 == 0. ... Here good means that if you keep >feeding c, you'll cycle thru the entire set of values [0,2^x). This is approaching what I was getting at, but is not quite there. Basically it says that 33 is `good' because it is `not bad'. An LCRNG is defined as h = ( a h + c ) mod m (where h is each a value of the n+1 n n variable `hash' in the for loop) (in quote `>' above, $m = 2^x$). With a `bad' multiplier for $a$---and what is `bad' depends on $c$ and $m$---the generator will not produce all $m$ possible values; with a `good' one it will. Intuitively, then, the `bad' one will miss some of the hash buckets entirely; the `good' one will not. But *why* is it true that `A good property is [that] it performs as if it's a [good] random number generator'? This is now `obvious' (from the above result) *if* we feed in constant values of $c$, i.e., if all the strings are made of the same characters, but are different lengths. But we rarely do that---the strings we hash are full of weird junk. How do LCRNGs get in there? Why do the varying values of $c$ not mess things up? (Of course, they do, but not enough to matter. Why not?) >At this point, we depart from theory. We typically hash byte strings ... Actually, I typically hash ASCII text. 33 is the `easiest' `good' multiplier (as defined above) that is >= 26. I have no idea whether this is significant, or merely coincidental. >... This suggests that some value larger than 2^8 (but not too large) >may do better as a multiplicator. ... > hash = 0; > for(each byte c) > hash = (hash<<8) + (hash<<2) + hash + c; I will see if I can get this one run on the same data as the `33 hash'. This is more work than hash = (hash << 5) + hash + c; and distribution improvements may be outweighed by additional hash time (or may not: who knows?). -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/25/91)
In article <15027@ulysses.att.com> kpv@ulysses.att.com (Phong Vo[drew]) writes: > A good property for a hash function is if you keep feeding it > a constant character, it performs as if it's a random number generator. No, I don't think so. This is commonly true of good hash functions and commonly false of bad ones, but keep in mind that pseudo-random sequences have to keep working 100, 1000, 1000000 elements later, while simple string hash functions are rarely applied to strings with more than 80 characters. [ as a generator mod 2^n, 37 is better than 33 is better than 31 ] Again, I don't think this is accurate. Each number has a sufficiently long period mod 2^n that you don't have to worry about repeats. > We typically hash byte strings and we want > the bits to mix in the 16-bit or 32-bit registers relatively fast. [ so use a multipler of 261 ] Again, practically any good multiplier works. I think you're worrying about the fact that 31c + d doesn't cover any reasonable range of hash values if c and d are between 0 and 255. That's why, when I discovered the 33 hash function and started using it in my compressors, I started with a hash value of 5381. I think you'll find that this does just as well as a 261 multiplier. ---Dan
kpv@ulysses.att.com (Phong Vo[drew]) (06/26/91)
> Again, practically any good multiplier works. I think you're worrying > about the fact that 31c + d doesn't cover any reasonable range of hash > values if c and d are between 0 and 255. That's why, when I discovered > the 33 hash function and started using it in my compressors, I started > with a hash value of 5381. I think you'll find that this does just as > well as a 261 multiplier. > > ---Dan 261 is just a number that was picked out of a hat that satisfies the conditions outlined in the theorems in Knuth. These conditions are pretty minimal so yes there are lots and lots of numbers that would do just as well. For compressors, if you work with binary files, it isn't infrequent to have strings of 0's. In such a case, you'll want the multiplier to have the nice properties of a RNG. For this reason, 261 type of numbers may be a little better than 33 (in either case, don`t forget to start the hash sum at some odd value!). For everyone's entertainment, I did a simple (and admittedly contrived) test of hashing 1000000 numbers starting at 1000000 into 1000000 buckets. This is done by treating each number which is stored in a word as a string of 4 bytes. With 261, 927300 of these numbers are in their own buckets, the rest are in 36350 buckets each with 2 elements. This is almost as good as a perfect hash function. The result for 33 is much worse. The largest bucket has 64 elements in it and there are 4050 such buckets. The key difference is the 8th bit in 261 which causes the mixing to go a little faster.
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/26/91)
In article <15047@ulysses.att.com> kpv@ulysses.att.com (Phong Vo[drew]) writes: > For everyone's entertainment, I did a simple (and admittedly contrived) test > of hashing 1000000 numbers starting at 1000000 into 1000000 buckets. > This is done by treating each number which is stored in a word as a > string of 4 bytes. [ results supposedly showing 261 much better than 33 ] Sorry, but the mod has to be a power of 2. Also, your test is remarkably contrived: we're talking about string hash functions, and typical strings (a) are concentrated on a few character values, not spread evenly over the entire character set; (b) do not all have the same high byte; (c) do not all have the same length. Surely you noticed that a multiplier of 256 will produce perfect hashing in this case. Does that mean you're recommending a multiplier of 256? The profiling statistics I've saved from various compressors form a much more realistic sample. They show 33 as slightly better than random hashing on typical files, 37 as slightly worse. I've never tried 261 or 31, but I'll bet 261 does worse than 33 on text files. ---Dan
evans@syd.dit.CSIRO.AU (Bruce.Evans) (06/26/91)
In article <14583@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: >>> hash = 0; >>> for (each character) >>> hash = (hash * 33) + (the character); >>> hash_index = hash & ((some power of two) - 1); > >In article <14491.28619071@stjhmc.fidonet.org> >Dave.Harris@f14.n15.z1.fidonet.org (Dave Harris) writes: >>Is the *33 use to keep short sequences has key more random or does it >>really help in the distribution of large string too? > >The third question is the hardest. What *does* that 33 do? I have no >idea. All I know is that I have experimented with `easy' functions I don't think there are any theoretical mysteries behind the 33. 33 = 32 + 1. The 32 is to avoid mixing bits with previous hash bits. 256 would avoid all mixing (with 8-bit chars) but it would throw away too many bits. The 1 is to avoid throwing away bits. Rotating the bits instead of shifting left would probably be a better way to avoid this. The following hash function worked well for me in an assembler. It implements the ideas that hashing all the chars in the string is unnecessary (and maybe even bad if too many bits get shifted out) and that the middle char is more random than leading or trailing chars. In the program this was extracted from, the string length was already known. Avoiding looking at all the chars in the string is not such a good idea when strlen has to do it anyway. Also, this probably does best with a lot of short strings. --- extern unsigned strlen(); /* #include <string.h> */ #define SPTSIZE 1024 extern struct sym_s *spt[SPTSIZE]; #define hconv(ch) ((unsigned char) (ch) - 0x41) /* better form for hashing */ struct sym_s **gethashptr(str) register char *str; { register unsigned hashval; register unsigned length; /* Hash function is a weighted xor of 1 to 4 chars in the string. * This seems to work better than looking at all the chars. * It is important that the function be fast. * The multiplication by MULTIPLIER should compile as a shift. */ #define MULTIPLIER (SPTSIZE / (1 << USEFUL_BITS_IN_ASCII)) #define USEFUL_BITS_IN_ASCII 6 length = strlen(str); if (length <= 3) { if (length <= 2) hashval = hconv(str[-1]) * MULTIPLIER; else hashval = hconv(str[-2]) * MULTIPLIER, hashval ^= hconv(str[-1]); } else hashval = hconv(str[-(length / 2)]) * MULTIPLIER, hashval ^= hconv(str[-2]) << 2, hashval ^= hconv(str[-1]); return spt + (hashval ^ (hconv(str[0]) << 1)) % SPTSIZE; } --- -- Bruce Evans evans@syd.dit.csiro.au
dik@cwi.nl (Dik T. Winter) (06/26/91)
In article <17918.Jun2608.17.5091@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > In article <15047@ulysses.att.com> kpv@ulysses.att.com (Phong Vo[drew]) writes: > > For everyone's entertainment, I did a simple (and admittedly contrived) test > > of hashing 1000000 numbers starting at 1000000 into 1000000 buckets. > > This is done by treating each number which is stored in a word as a > > string of 4 bytes. > > Sorry, but the mod has to be a power of 2. Why? (If you are telling me for speed I'll scream.) > > Surely you noticed that a multiplier of 256 will produce perfect hashing > in this case. Does that mean you're recommending a multiplier of 256? Only on big-endian machines of course :-) > > The profiling statistics I've saved from various compressors form a much > more realistic sample. They show 33 as slightly better than random > hashing on typical files, 37 as slightly worse. I've never tried 261 or > 31, but I'll bet 261 does worse than 33 on text files. > I did some simple tests, hashing all entries of /usr/dict/words, counting how many entries fall in each bucket. I did this with the four multipliers Dan just mentioned. I did this for 16 buckets to 8192 buckets (only powers of two done). I calculated also the standard deviations involved. I will not bore you with all those digits, the result can be stated simply: it does not matter very much what multiplier you use. For each multiplier there is a number of buckets where that multiplier has the smallest standard deviation. So use whatever you like. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
martin@adpplz.UUCP (Martin Golding) (06/27/91)
In <15047@ulysses.att.com> kpv@ulysses.att.com (Phong Vo[drew]) writes: >For everyone's entertainment, I did a simple (and admittedly contrived) test >of hashing 1000000 numbers starting at 1000000 into 1000000 buckets. >This is done by treating each number which is stored in a word as a >string of 4 bytes. With 261, 927300 of these numbers are >in their own buckets, the rest are in 36350 buckets each with 2 elements. >This is almost as good as a perfect hash function. The result for 33 is >much worse. The largest bucket has 64 elements in it and there are 4050 >such buckets. The key difference is the 8th bit in 261 which causes the >mixing to go a little faster. Your test is probably irrelevant, being contiguous sequential numbers. The best hash in this case is for( hash = (i = 0); i < len(numstr); hash = hash*10 + numstring[i++]); hash = hash % numberofbuckets; This will yield the best possible distribution for any number of hash buckets. The proof is left to the reader. The point is not that there is a better hash for sequential numbers, but that the performance of _any_ hash function is highly dependent on your keys; if you any idea of the values or kinds of values to expect you can choose a better function. If you can _control_ the values of your keys you can sometimes get a _much_ better function; viz the numbers above. If all this is old and boring stuff, I apologise for interrupting you all just to look like an idiot (flame me by email), and I now return you to your regularly scheduled newsgroup. Martin Golding | sync, sync, sync, sank ... sunk: Dod #0236 | He who steals my code steals trash. A poor old decrepit Pick programmer. Sympathize at: {mcspdx,pdxgate}!adpplz!martin or martin@adpplz.uucp
martin@adpplz.UUCP (Martin Golding) (06/27/91)
In <843@adpplz.UUCP> martin@adpplz.UUCP (Martin Golding) writes:
A bunch of nonsense, unless you realize I misread the previous post,
and was thinking of a string of ascii digits. Then it all becomes
clear, and I stand by everything I said, all of which was irrelevant.
Sorry.
Martin Golding | sync, sync, sync, sank ... sunk:
Dod #0236 | He who steals my code steals trash.
A poor old decrepit Pick programmer. Sympathize at:
{mcspdx,pdxgate}!adpplz!martin or martin@adpplz.uucp
torek@elf.ee.lbl.gov (Chris Torek) (06/27/91)
>In article <14583@dog.ee.lbl.gov> I wrote: >>The [`why'] question is the hardest. What *does* that 33 do? I have no >>idea. ... In article <1991Jun26.112724.20539@syd.dit.CSIRO.AU> evans@syd.dit.CSIRO.AU (Bruce.Evans) writes: >I don't think there are any theoretical mysteries behind the 33. 33 = 32 + 1. >The 32 is to avoid mixing bits with previous hash bits. 256 would avoid all >mixing (with 8-bit chars) but it would throw away too many bits. The 1 is to >avoid throwing away bits. This answer just does not hold up: if this were the case, 65, 129, etc., should work as well as 33. They do not. Why not? >The following hash function worked well for me in an assembler. ... > length = strlen(str); ... hashval = hconv(str[-1]) * MULTIPLIER; At this point, str[-1] is unlikely to exist. The function seems to expect a pointer to the last character, rather than the first. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov