raab@novavax.UUCP (Moshe Raab) (08/17/88)
could anyone recommend a hashing algorithm to store a list of 10 digit phone numbers (area code plus 7 digits). It should have little or no overflow and no collisions (as few as possible). The list will contain more than one area code but about 1000 numbers per area code (ie not a totasly random sample but one which has a relatively common prefix) thank you very much.
decot@hpisod2.HP.COM (Dave Decot) (08/17/88)
> could anyone recommend a hashing algorithm to store a list of 10 > digit phone numbers (area code plus 7 digits). > It should have little or no overflow and no collisions (as few as > possible). The list will contain more than one area code but > about 1000 numbers per area code (ie not a totasly random sample > but one which has a relatively common prefix) > thank you very much. Treat the ten digits as an integer, n, and compute the index as (n % HASH_SIZE), where HASH_SIZE is the size of your hash table. Store the number, and the record for that number, in the first empty hash table location at or after the index you've computed. HASH_SIZE should be about four times the maximum number of phone numbers you expect to store, to reduce collisions. What data is associated with each phone number? What do you intend to do with the list? Do you want to look up things by phone number, or by some other key? Dave
djones@megatest.UUCP (Dave Jones) (08/18/88)
From article <2550078@hpisod2.HP.COM>, by decot@hpisod2.HP.COM (Dave Decot): >> could anyone recommend a hashing algorithm to store a list of 10 >> digit phone numbers (area code plus 7 digits). >> It should have little or no overflow and no collisions (as few as >> possible). The list will contain more than one area code but >> about 1000 numbers per area code (ie not a totasly random sample >> but one which has a relatively common prefix) >> thank you very much. > > Treat the ten digits as an integer, n, and compute the index > as (n % HASH_SIZE), where HASH_SIZE is the size of your hash table. > > Store the number, and the record for that number, in the first empty > hash table location at or after the index you've computed. ... There are a couple of rather serious problems with the algorithm suggested. One is that the high order bits in the phone numbers will not affect the hash. Say for example you store all of radio station KGO's talk numbers, (415) 263-TALK, (408) 469-TALK, etc. (or whatever they are). They all map to the same slot. Secondly, the "first-empty-slot-after" algorithm tends to build up clumps which the subsequent linear searches have to trace through, adding to the size of the clump, and thus the likelyhood that it will get hit next time. I have devised a method for hashing which has not been published previously. It is very very fast. Here is the WORLD PREMIERE!! First let's settle the business of how to hash integers into the range 0..M-1. I suggest that you use hash(I) = (I*32821) % M The number 32821 has some magic properties that spatter things about nicely. See "Algorithms" by Sedgewick. My hashing technique uses a table-size M which is always a power of two. This makes the modulo-function (%) very fast on a two's compliment machine, if you are willing to accept a tiny bit of machine-dependance. (I never expect to use a machine that is not two's compliment, so I'm not too scared.) You code the mod function as H & mask rather than H % table_size where mask is table_size-1. For a "rehash" function, rather that looking at the next slot, I use the function defined by the following recursive definition: rehash(0,slot) = slot rehash(i+1, slot) = ((rehash(i,slot)+1)*3) mod M When M is a power of two, this fuction cycles after precisely M/2 steps. I have proved as much, but my proof is tedious. The range of the sequence is the set of numbers in 0..M-1 which are congruent to slot or slot+3 modulo 4. Don't ask me how I discovered this; I think it came to me in a vision. When the table gets half full, I double its size and hash everything into the new table. The routines allow for removing entries, as well as adding them and looking them up, but I have omited the removal routine, which is quite tricky. Here is an excerpt from the actual code: (For purposes of the telephone number problem, we may assume that "entry" is a record which contains a telephone number and some other stuff, and that obj->eq is a pointer to a function that returns true when two entries have the same tele-number in them and false otherwise. obj->hash is a pointer to a function that multiplies the tele number in an entry by 32821.) The following provably terminates because the rehash sequence hits exactly half the slots, and the table is kept less than half full. /* CAVEAT: The following assumes that integers are two's complement. */ #define HASH(cont) (((*(obj->hash))(cont)) & obj->mask ) #define REHASH(num) (((((num)+1)*3) & obj->mask) ) Ptr Hash_put(obj, entry ) register Hash* obj; Ptr entry; { register int bucket_number; register Ptr* bucket; bucket_number = HASH(entry); while(1) { bucket = obj->hash_table + bucket_number; if ( *bucket == (Ptr)0 ) { *bucket = entry; obj->num_entries++; if ( obj->num_entries > obj->max_entries ) Hash_overflow(obj); /* double size of table */ return (Ptr)0; /* <======== added new entry */ } if ( !obj->eq( entry, *bucket ) ) { bucket_number = REHASH(bucket_number); continue; /* <====== search some more (collision) */ } /* Found old Ptr. Replace. */ { Ptr old = *bucket; *bucket = entry; return old; /* <============== replaced old entry */ } } } Ptr Hash_get(obj, entry ) register Hash* obj; Ptr entry; { register int bucket_number; register Ptr* bucket; bucket_number = HASH(entry); while(1) { bucket = obj->hash_table + bucket_number; if ( *bucket == (Ptr)0 ) { return (Ptr)0; /* <====== entry not found */ } if ( !obj->eq( entry, *bucket) ) { bucket_number = REHASH(bucket_number); continue; /* <====== search some more (collision) */ } return *bucket; /* <====== found entry */ } }
zeeff@b-tech.UUCP (Jon Zeeff) (08/19/88)
In article <2550078@hpisod2.HP.COM> decot@hpisod2.HP.COM (Dave Decot) writes: >> could anyone recommend a hashing algorithm to store a list of 10 >> digit phone numbers (area code plus 7 digits). >Treat the ten digits as an integer, n, and compute the index >as (n % HASH_SIZE), where HASH_SIZE is the size of your hash table. > I don't second Dave's recommendation. What seems like a fairly random hash function usually doesn't work very well when you actually measure it's performance. Here is one stolen from pathalias that works well. /* This is a simplified version of the pathalias hashing function. * Thanks to Steve Belovin and Peter Honeyman * * hash a string into a long int. 31 bit crc (from andrew appel). * the crc table is computed at run time by crcinit() -- we could * precompute, but it takes 1 clock tick on a 750. * * This fast table calculation works only if POLY is a prime polynomial * in the field of integers modulo 2. Since the coefficients of a * 32-bit polynomial won't fit in a 32-bit word, the high-order bit is * implicit. IT MUST ALSO BE THE CASE that the coefficients of orders * 31 down to 25 are zero. Happily, we have candidates, from * E. J. Watson, "Primitive Polynomials (Mod 2)", Math. Comp. 16 (1962): * x^32 + x^7 + x^5 + x^3 + x^2 + x^1 + x^0 * x^31 + x^3 + x^0 * * We reverse the bits to get: * 111101010000000000000000000000001 but drop the last 1 * f 5 0 0 0 0 0 0 * 010010000000000000000000000000001 ditto, for 31-bit crc * 4 8 0 0 0 0 0 0 */ #define POLY 0x48000000L /* 31-bit polynomial (avoids sign problems) */ static long CrcTable[128]; static void crcinit() { register int i, j; register long sum; for (i = 0; i < 128; ++i) { sum = 0L; for (j = 7 - 1; j >= 0; --j) if (i & (1 << j)) sum ^= POLY >> j; CrcTable[i] = sum; } DEBUG("crcinit: done\n"); } static long hash(name, size) register char *name; register int size; { register long sum = 0L; while (size--) { sum = (sum >> 7) ^ CrcTable[(sum ^ (*name++)) & 0x7f]; } DEBUG("hash: returns (%ld)\n", sum % INDEX_SIZE); return(sum % INDEX_SIZE); } -- Jon Zeeff Branch Technology, uunet!umix!b-tech!zeeff zeeff%b-tech.uucp@umix.cc.umich.edu
bill@proxftl.UUCP (T. William Wells) (08/19/88)
In article <654@novavax.UUCP> raab@novavax.UUCP (Moshe Raab) writes:
: could anyone recommend a hashing algorithm to store a list of 10
: digit phone numbers (area code plus 7 digits).
: It should have little or no overflow and no collisions (as few as
: possible). The list will contain more than one area code but
: about 1000 numbers per area code (ie not a totasly random sample
: but one which has a relatively common prefix)
: thank you very much.
You might want to try a hash trie (sic). The Programming Pearls
column in the June 1986 Communications of the ACM describes one
way to do this. I have not tried this so I can give you no
estimates on the space needed to do it, but access to a hash trie
is very fast though perhaps not as fast as a very sparse hash
table.
rlb@polari.UUCP (rlb) (08/20/88)
In article <723@goofy.megatest.UUCP>, djones@megatest.UUCP (Dave Jones) writes: > From article <2550078@hpisod2.HP.COM>, by decot@hpisod2.HP.COM (Dave Decot): > >> could anyone recommend a hashing algorithm to store a list of 10 > >> digit phone numbers (area code plus 7 digits). > > ... > > Treat the ten digits as an integer, n, and compute the index > > as (n % HASH_SIZE), where HASH_SIZE is the size of your hash table. > > At this point, no one has pointed out that modulo is a better hash function if the table size is a prime number. > > ... > > I have devised a method for hashing which has not been published previously. > It is very very fast. Here is the WORLD PREMIERE!! > > First let's settle the business of how to hash integers into the > range 0..M-1. I suggest that you use hash(I) = (I*32821) % M > > My hashing technique uses a table-size M which is always a power of two. > This makes the modulo-function (%) very fast on a two's compliment machine, Well, yes, but you're not counting the cost of multiplying by 32821 elsewhere, so, in reality, your way seems to be a wee bit slower than (key % prime). > ... long discussion/demonstration of rehashing technique ... > To summarize: You've created a hashing scheme. Forming the hash itself isn't faster than the simple (key % prime) algorithm. Re-hashing and deletions require more code than the simple "put it in the next empty slot" method. It requires more storage than the simple algorithm (I'm assuming it "breaks" if you let the table get more than half full, I didn't actually figure it out). The principal advantage seems to be that access time is flatter (not, flat, but flatter, and probably not much flatter) than the simple "put it in the next empty slot" method. If you're not going to let the table get more than half full, then (assuming a fairly successful hash) there's only about a 50% chance that you'll have to rehash. So, even with a bit of clumping, doesn't it seem likely that the simplest rehashing scheme compares quite favorably with the one you've created? Agree? Disagree?
daveb@llama.rtech.UUCP (Dave Brower) (08/23/88)
In article <4712@b-tech.UUCP> zeeff@b-tech.UUCP (Jon Zeeff) writes: >In article <2550078@hpisod2.HP.COM> decot@hpisod2.HP.COM (Dave Decot) writes: >>> could anyone recommend a hashing algorithm to store a list of 10 >>> digit phone numbers (area code plus 7 digits). >>Treat the ten digits as an integer, n, and compute the index >>as (n % HASH_SIZE), where HASH_SIZE is the size of your hash table. > >I don't second Dave's recommendation. What seems like a fairly random >hash function usually doesn't work very well when you actually measure >it's performance. Here is one stolen from pathalias that works well. > ... >static long >hash(name, size) >register char *name; >register int size; >{ > register long sum = 0L; > > while (size--) { > sum = (sum >> 7) ^ CrcTable[(sum ^ (*name++)) & 0x7f]; > } > DEBUG("hash: returns (%ld)\n", sum % INDEX_SIZE); > return(sum % INDEX_SIZE); >} > The konstant factors add up. It is often the case that a better hash isn't worth the effort. The expression: sum = (sum >> 7) ^ CrcTable[(sum ^ (*name++)) & 0x7f]; take many more instructions to figure than the naive: sum += *name++; If the collision chains generated by the naive function are short, it may be cheaper to search the chain than use a smarter hash. What I found in some real application, with about 3000 entries in the table, was that naive beat smart in real execution time. The naive collision chains were typically 4-7 entries long. I could not reduce the time spent in the smart hash function enough to make it worth using. The wall was high enough to get me to try to use splay trees instead of hashing. When it was all over, my actual cpu figures were the same for tree and hash, giving ordering in the tree for free. Moral: Try several ways and *measure*. The results may be surprising. I posted the tree library in comp.source.unix Archive-name v15i005. Thanks to Doug Jones@uiowa for his original Pascal implementation. -dB "Ready when you are Raoul!" {amdahl, cpsc6a, mtxinu, sun, hoptoad}!rtech!daveb daveb@rtech.com <- FINALLY!
daveb@llama.rtech.UUCP (Dave Brower) (08/23/88)
In article <2392@rtech.rtech.com> daveb@llama.UUCP I say: > >I posted the tree library in comp.source.unix Archive-name v15i005. >Thanks to Doug Jones@uiowa for his original Pascal implementation. Lies! It is really: Subject: v14i087: Splay tree library Posting-number: Volume 14, Issue 87 Archive-name: splay-tree -dB "Ready when you are Raoul!" {amdahl, cpsc6a, mtxinu, sun, hoptoad}!rtech!daveb daveb@rtech.com <- FINALLY!
djones@megatest.UUCP (Dave Jones) (08/23/88)
I recently posted an original hash-table routine here. It is the result of a great deal of research and experimentation. I never expected anyone to say, "Thanks," and sure enough, no one has. That's okay. This note is just to indicate that I do not intend to respond to postings discussing the relative values of hashing methods, although I would be happy to see other algorithms, especially compilable C code. It's a fascinating subject, which has been the topic of extensive research over the years, but I just don't have the time or inclination right now. When I did the comparisons between various methods, I did not save the statistics, and I have no interest in repeating them. Nor would they necessarily be applicable to all machines, if I still had them. The constraints that I was under were roughly these: 0. Multiple hash-table "objects" was essential. 1. Multiplication by a variable was slow. 2. Division by a variable was about four times as slow as multiplication by a variable. 3. Multiplication by constant 3 was implemented by the compiler as a shift and add and was therefore fast. 4. Malloc() was too slow to consider. 5. The machine was, and was likely to remain forever, a two's complement machine.
pardo@june.cs.washington.edu (David Keppel) (08/24/88)
djones@megatest.UUCP (Dave Jones) writes: >[ posted hash algorithm, no thanks from anybody ] Personally, I don't have any use for hashing right now (but I'll probably be back to you in a few weeks :-), I *do* appreciate various source postings to various groups. They encourage me to expand my freeware libraries and help me write better code, faster. I hope I speak for the net when I say "thanks for being thoughtful enough to share your code". ;-D on ( Freeware is free where the free wear ) Pardo -- pardo@cs.washington.edu {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo
henry@utzoo.uucp (Henry Spencer) (08/24/88)
In article <2392@rtech.rtech.com> daveb@llama.UUCP (Dave Brower) writes: >If the collision chains generated by the naive function are short, it >may be cheaper to search the chain than use a smarter hash. > >What I found in some real application, with about 3000 entries in the >table, was that naive beat smart in real execution time. The naive >collision chains were typically 4-7 entries long. I could not reduce >the time spent in the smart hash function enough to make it worth using. I second this. Those of you who have looked over that masterpiece of brilliant coding :-), the C News expire, will have noticed that it hashes newsgroups simply by the length of the name. Crude and inelegant... but the longest collision chain is only a handful of entries long, and none of the fancier hash functions did significantly better. -- Intel CPUs are not defective, | Henry Spencer at U of Toronto Zoology they just act that way. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
lvc@cbnews.ATT.COM (Lawrence V. Cipriani) (08/25/88)
In article <4712@b-tech.UUCP>, zeeff@b-tech.UUCP (Jon Zeeff) writes: > /* This is a simplified version of the pathalias hashing function. > * Thanks to Steve Belovin and Peter Honeyman > static long > hash(name, size) > register char *name; > register int size; > { > register long sum = 0L; > > while (size--) { > sum = (sum >> 7) ^ CrcTable[(sum ^ (*name++)) & 0x7f]; > } > DEBUG("hash: returns (%ld)\n", sum % INDEX_SIZE); > return(sum % INDEX_SIZE); > } If you change the routine to this: static long hash(name, size, sum) register char *name; register int size, sum; { while (size--) { sum = (sum >> 7) ^ CrcTable[(sum ^ (*name++)) & 0x7f]; } DEBUG("hash: returns (%ld)\n", sum % INDEX_SIZE); return(sum % INDEX_SIZE); } where the initial value of sum is 0. With the change you can use hash() with data objects that won't fit in memory. That is: int sum = 0; char iobuf[BUFSUZ]; while ((nbytes = read(fd, iobuf, sizeof iobuf)) > 0) sum = hash(iobuf, nbytes, sum); After the last call to hash, the value of sum is the sum for the whole file. This is not an original idea of mine. I saw it in a piece of 10 year old code. -- Larry Cipriani, AT&T Network Systems, Columbus OH, cbnews!lvc lvc@cbnews.ATT.COM
ray@micomvax.UUCP (Ray Dunn) (09/01/88)
In article <2392@rtech.rtech.com> daveb@llama.UUCP (Dave Brower) writes: > ... > If the collision chains generated by the naive function are short, it > may be cheaper to search the chain than use a smarter hash. > ... Definitely! > Moral: Try several ways and *measure*. The results may be surprising. Better Moral: Whenever possible analyze the properties and distributions of the data you want to handle and *then* design the hashing algorithm. At that point it is then reasonable to measure and tinker. This discussion is getting very like those on sorting - please realize there is no *general* best method, there are only best methods in specific application environments. The goal should not normally be to minimize the search time for an individual item, but to minimize the total time spent searching in a given run session or series of runs, of the application - a fact often ignored in discussions of minimizing chain lengths etc.. If the tokens "(", ")", ";" and "," each occur (i.e. require to be "looked-up") approximately n times more frequently than the tokens "while", "for", "switch" and "do", then wouldn't it be reasonable to design your algorithm so that the former items were located approximately n times faster than the latter? This can be achieved in a variety of ways, including, let us not forget, ad hoc tests for specific high frequency items prior to applying a generalized routine. In the case of a fixed data-base, for example the reserved words in a compiler, it does not take much effort to analyze a filesystem or three for token frequencies and then structure and *order* the dictionary accordingly. -- Ray Dunn. | UUCP: ..!philabs!micomvax!ray Philips Electronics Ltd. | TEL : (514) 744-8200 Ext: 2347 600 Dr Frederik Philips Blvd | FAX : (514) 744-6455 St Laurent. Quebec. H4M 2S9 | TLX : 05-824090
gregs@hpmtlx.HP.COM ($Greg Stander) (09/10/88)
/ hpmtlx:comp.lang.c / zeeff@b-tech.UUCP (Jon Zeeff) / 6:00 pm Aug 18, 1988 / In article <2550078@hpisod2.HP.COM> decot@hpisod2.HP.COM (Dave Decot) writes: >> could anyone recommend a hashing algorithm to store a list of 10 >> digit phone numbers (area code plus 7 digits). >Treat the ten digits as an integer, n, and compute the index >as (n % HASH_SIZE), where HASH_SIZE is the size of your hash table. > I don't second Dave's recommendation. What seems like a fairly random hash function usually doesn't work very well when you actually measure it's performance. Here is one stolen from pathalias that works well. /* This is a simplified version of the pathalias hashing function. * Thanks to Steve Belovin and Peter Honeyman * * hash a string into a long int. 31 bit crc (from andrew appel). * the crc table is computed at run time by crcinit() -- we could * precompute, but it takes 1 clock tick on a 750. * * This fast table calculation works only if POLY is a prime polynomial * in the field of integers modulo 2. Since the coefficients of a * 32-bit polynomial won't fit in a 32-bit word, the high-order bit is * implicit. IT MUST ALSO BE THE CASE that the coefficients of orders * 31 down to 25 are zero. Happily, we have candidates, from * E. J. Watson, "Primitive Polynomials (Mod 2)", Math. Comp. 16 (1962): * x^32 + x^7 + x^5 + x^3 + x^2 + x^1 + x^0 * x^31 + x^3 + x^0 * * We reverse the bits to get: * 111101010000000000000000000000001 but drop the last 1 * f 5 0 0 0 0 0 0 * 010010000000000000000000000000001 ditto, for 31-bit crc * 4 8 0 0 0 0 0 0 */ #define POLY 0x48000000L /* 31-bit polynomial (avoids sign problems) */ static long CrcTable[128]; static void crcinit() { register int i, j; register long sum; for (i = 0; i < 128; ++i) { sum = 0L; for (j = 7 - 1; j >= 0; --j) if (i & (1 << j)) sum ^= POLY >> j; CrcTable[i] = sum; } DEBUG("crcinit: done\n"); } static long hash(name, size) register char *name; register int size; { register long sum = 0L; while (size--) { sum = (sum >> 7) ^ CrcTable[(sum ^ (*name++)) & 0x7f]; } DEBUG("hash: returns (%ld)\n", sum % INDEX_SIZE); return(sum % INDEX_SIZE); } -- Jon Zeeff Branch Technology, uunet!umix!b-tech!zeeff zeeff%b-tech.uucp@umix.cc.umich.edu ----------