vijay@bradley.UUCP (09/18/89)
Hello, I hope you can help me. Does anyone out there have a perfect hash function? The key is a string of characters (maximum length is 4, minimum is 2) that make up the opcode set of an assembler that I am in the process of writing. I tried some home-brewed hash functions, but got some collisions. I also tried to read some articles on the subject in ACM, but you need a PHD in maths to make any sense out of them!! Any help would be appreciated. Thanks in advance (you can e-mail me directly, if you please..) -- Vijay K. Gurbani | {uiucdcs, inhp4, cepu}!bradley!vijay | "Learn how to Distributed Systems | vijay@bradley.edu | avoid RIPOFFS - Bradley University | (309)677-2339 | Send $10.00 for Peoria, IL 61605 | | details!!"
jeffrey@algor2.algorists.com (Jeffrey Kegler) (09/21/89)
In article <9900014@bradley> vijay@bradley.UUCP writes: >Does anyone out there have a perfect hash function? The key is a >string of characters (maximum length is 4, minimum is 2) that make up >the opcode set of an assembler that I am in the process of writing. >I also tried to read some articles on the subject in ACM, but you >need a PHD in maths to make any sense out of them!! I wrote one of those articles (_Communications of the ACM_, June 1986), so maybe I can help. Perfect hashing is almost never practical. For any number of keys, some type of faster and better method can be devised. [ The secret to this is realizing perfect hashes do not run in constant time. ] In your case a binary search looks like a winner. If you do come up with a perfect hash of your opcodes, and code it up, what happens when an opcode is added (or deleted)? You need a new perfect hash. It certainly makes the job of maintaining the assembler interesting. Another interesting approach for the opcode search is a self-organizing search, which will take advantage of the fact the some of the opcodes are used far more often than others. -- Jeffrey Kegler, Independent UNIX Consultant, Algorists, Inc. jeffrey@algor2.ALGORISTS.COM or uunet!algor2!jeffrey 1762 Wainwright DR, Reston VA 22090
djones@megatest.UUCP (Dave Jones) (09/23/89)
From article <9900014@bradley>, by vijay@bradley.UUCP: > > Hello, > > I hope you can help me. Does anyone out there have a perfect > hash function? The key is a string of characters (maximum length > is 4, minimum is 2) ... How about another idea? For a similar application, I recently wrote a little program which generates a big switch statement for recognizing strings. You give it a list of strings, each paired with its enumeration value, and it spits out the switch. It's real fast. It helps if you optimize away jumps to jumps, of course. It only took about an hour to write the thing. Here's the first few lines of a switch it generated for recognizing Pascal keywords in lower case: switch(*str++) /* */ { default: break; case 'a': /* a */ switch(*str++) /* a */ { default: break; case 'n': /* an */ switch(*str++) /* an */ { default: break; case 'd': /* and */ if(length==3)type = 307; break; } break; case 'r': /* ar */ switch(*str++) /* ar */ { default: break; case 'r': /* arr */ switch(*str++) /* arr */ { default: break; case 'a': /* arra */ switch(*str++) /* arra */ { default: break; case 'y': /* array */ if(length==5)type = 257; break; } break; } break; } break; } break; etc...
schmidt@glacier.ics.uci.edu (Doug Schmidt) (09/24/89)
In article <9900014@bradley>, vijay@bradley writes: > >Hello, > > I hope you can help me. Does anyone out there have a perfect >hash function? The key is a string of characters (maximum length >is 4, minimum is 2) that make up the opcode set of an assembler >that I am in the process of writing. I tried some home-brewed >hash functions, but got some collisions. I also tried to read some >articles on the subject in ACM, but you need a PHD in maths to >make any sense out of them!! Any help would be appreciated. > >Thanks in advance (you can e-mail me directly, if you please..) If you've got access to anonymous ftp you should pick up my perfect hash function generator program `gperf.' It is available from ics.uci.edu (128.195.1.1) in ~ftp/pub/cperf-1.8.tar.Z. Written in C, this program is used to automatically generate the reserved word lookup routines used in the GNU C, GNU C++, and GNU Pascal compilers, as well as the GNU indent program. The distribution comes with extensive documentation, and should compile easily with standard K&R C compilers. If you don't have access to anonymous ftp I can send you a version via email, or you can wait until gperf is posted to comp.sources.unix (shortly, I hope!). Doug -- schmidt@ics.uci.edu (ARPA) | Per me si va nella citta' dolente. office: (714) 856-4043 | Per me si va nell'eterno dolore. | Per me si va tra la perduta gente. | Lasciate ogni speranza o voi ch'entrate.
flynn@clotho.acm.rpi.edu (Kevin Lincoln Flynn) (09/25/89)
As long as people are talking hashing, can anyone think of a GOOD way to hash 16-bit numbers determined dynamically at runtime into a fast way to look them up? Specifically, I'm talking about the attribute dictionary of an object in an OO system we're working on here -- the tags of attributes are 16-bit handles about which nothing is known at runtime. Someone suggested closed hashing, with the number of buckets growing by roughly doubling primes.... ie start with two buckets, then 3, then 7, then 17, etc..... any ideas out there? (oh yeah, the hash function as well as the strategy'd by nice. [ :) ]) Thanks in advance!! - Flynn Kevin Lincoln Flynn flynn@acm.rpi.edu, userfwvl@mts.rpi.edu 147 1st Street H (518) 273-6914 W (518) 447-8561 Troy, NY 12180 ...Argue for your limitations, and sure enough they're yours.
bill@twwells.com (T. William Wells) (09/26/89)
In article <1989Sep24.214153.8867@rpi.edu> flynn@acm.rpi.edu (Kevin Lincoln Flynn) writes:
: As long as people are talking hashing, can anyone think of a GOOD way to hash
: 16-bit numbers determined dynamically at runtime into a fast way to look them
: up? Specifically, I'm talking about the attribute dictionary of an object in
: an OO system we're working on here -- the tags of attributes are 16-bit handles
: about which nothing is known at runtime. Someone suggested closed hashing,
: with the number of buckets growing by roughly doubling primes.... ie start
: with two buckets, then 3, then 7, then 17, etc..... any ideas out there?
: (oh yeah, the hash function as well as the strategy'd by nice. [ :) ])
Here's my invention, which is not quite a hash table, but will do
quite as well: create a binary tree in which the numbers are used to
control the descent of the tree. In other words, if you had three
numbers (of four bits): 0100, 0110, 1000, your tree would look like
this:
*
/ \
* 1000
/ \
0100 0110
To find 0110, look at the left bit, a zero, which means go left; the
next bit, a 1 means go right, and then, since we are at a leaf, we
are done. If the item is not in the tree, one runs into a leaf that
is not equal to the item or a null leaf pointer. Both insertion and
deletion should be fairly obvious. The times for search, insert, and
delete are all bounded and are linear in the maximum depth of the
tree.
There are no collisions and it is very fast to search. In fact, the
only drawback I know of is that you really should dynamically allocate
the node structures, unless you are willing to accept a difficult to
compute upper bound on the number of things you can store in the tree.
You can, however, add additional information to the nodes and keep the
number of nodes to no more than 50%, more or less, of the number of
leaves.
There are, of course, many variations on this. The one I invented it
with generates hash code for strings, uses the hash code to control
the tree traversal, and stores the items in small buckets at the
leaves.
---
Bill { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com
djones@megatest.UUCP (Dave Jones) (09/27/89)
From article <1989Sep24.214153.8867@rpi.edu>, by flynn@clotho.acm.rpi.edu (Kevin Lincoln Flynn): > As long as people are talking hashing, can anyone think of a GOOD way to hash > 16-bit numbers determined dynamically at runtime into a fast way to look them > up? Specifically, I'm talking about the attribute dictionary of an object in > an OO system we're working on here -- the tags of attributes are 16-bit handles > about which nothing is known at runtime. Someone suggested closed hashing, > with the number of buckets growing by roughly doubling primes.... ie start > with two buckets, then 3, then 7, then 17, etc..... any ideas out there? > (oh yeah, the hash function as well as the strategy'd by nice. [ :) ]) > Reluctant as I am to contribute to your delinquency by aiding and abetting dynamic attribute binding, well okay. :-| For a hash-function, it depends on the form of those 16-bit numbers. If all the bits are more or less equally important, you may just want to multiply by a peculiar number. Try 9821. Read Sedgewick's "Algorithms" or something by Knuth on pseudo-random number generation. Do some experiments to see what kind of collision rates you get with different numbers and different table-sizes. Numbers that end with x21, where x is even may avoid some problems. I don't pretend to understand this. Now then. Concerning the strategy.... I have never been convinced that it is advantageous to use prime numbers for sizing hash tables. Nobody has ever been able to do much more by way of convincing than to wave books at me. But when you read those books, they just say it's a Good Thing. No analysis of why it is a Good Thing. My reasoning is that if the hash-function is good, then the algebraic properties of the size of the table don't matter. Random squared is random. Or in this case, pseudo-random squared is pseudo-random. Why am I telling you all this? Well, even on some modern machines, dividing is expensive. I just ran a little test on my Sun 3/60 and it did 10,000,000 bitwise "&" operations five and a half times as fast as it did the same number of divide operations. I have one hash-table package that I use almost exclusively and it seems to give very good results. It keeps tables that are powers of two. Therefore to mod out the size of the table, it masks the hash-value with one & operation. I keep the table strictly less than half full, in order to keep the number of collisions low. The less-than-half-full property allows me to use a rather peculiar rehashing scheme that I dreamed up. It is based on a very remarkable property that I discovered, I don't know how. Too much coffee, probably. The property is that if M is a power of two, then the sequence S(0) = 0 S(n+1) = ((S(n)+1)*3) % M repeats after exactly M/2 steps. Since the "% M" can be done with an & operation, and the "*3" can be done with a shift and add, this is a fast rehash to calculate. And the algorithm does not have to check to see if it has cycled, because by keeping the table half full, we have guaranteed that the this sequence will have an empty slot in it. I've got a routine that puts something in the table if nothing else equivalent was there, or returns the equivalent thing. And a routine that replaces the equivalent thing and returns the old value, a routine that just looks for something, etc., etc... I even have a routine that removes entries. The average time (which should be always, if the hash- function is good) is O(1) for both insertion and removal. Worst case for insertion is O(n) and for removal is O(n**2), but that doesn't matter, of course. Here's a sample. (The way the bit-mask for modding out the power of two is calculated assumes a two's complement machine. I don't ever expect to use the other kind.) And oh yes. Please forgive the dynamic binding of the "hash" and "eq" attributes. In the OO attribute looker-upper you will no doubt want to do the right thing and expand the hash and eqivalence functions in line. #define HASH(cont) (((*(obj->hash))(cont)) & obj->mask ) #define REHASH(num) (((((num)+1)*3) & obj->mask) ) static Ptr* new_table(); static int round_up_to_pwr2(); static void Hash_overflow(); Hash* Hash_init_sized(obj, hash, eq, initial_table_size) register Hash* obj; int (*hash)(); Bool (*eq)(); { if(initial_table_size < 4) initial_table_size = 4; obj->size = round_up_to_pwr2(initial_table_size); obj->num_entries = 0; obj->max_entries = obj->size/2 - 1; obj->mask = obj->size - 1; obj->hash_table = new_table(obj->size); obj->hash = hash; obj->eq = eq; return obj; } /* Put a new entry into the table. If there was previously an * equivalent entry in the table, return it. * If there was not, return (Ptr)0. */ Ptr Hash_put(obj, entry ) register Hash* obj; Ptr entry; { register int bucket_number; register Ptr* bucket; bucket_number = HASH(entry); while(1) { bucket = obj->hash_table + bucket_number; if ( *bucket == (Ptr)0 ) { *bucket = entry; obj->num_entries++; if ( obj->num_entries > obj->max_entries ) Hash_overflow(obj); return (Ptr)0; /* <======== added new entry */ } if ( !obj->eq( entry, *bucket ) ) { bucket_number = REHASH(bucket_number); continue; /* <====== search some more (collision) */ } /* Found old Ptr. Replace. */ { Ptr old = *bucket; *bucket = entry; return old; /* <============== replaced entry */ } } }
oz@yunexus.UUCP (Ozan Yigit) (09/27/89)
In article <1989Sep26.133339.2890@twwells.com> bill@twwells.com (T. William Wells) writes: >Here's my invention, which is not quite a hash table, but will do >quite as well: create a binary tree in which the numbers are used to >control the descent of the tree. Nice trie !! :-) Actually, it is nice to be able to invent things (I did that once or twice), even if they are about a decade old. See "radix search tries" or "compressed tries" or "digital search tries" . Also see Knuth, or "Handbook of Algorithms and Data Structures" by G. H. Gonnet, Addison-Wesley ISBN 0-201-14218-X. oz -- The king: If there's no meaning Usenet: oz@nexus.yorku.ca in it, that saves a world of trouble ......!uunet!utai!yunexus!oz you know, as we needn't try to find any. Bitnet: oz@[yulibra|yuyetti] Lewis Carroll (Alice in Wonderland) Phonet: +1 416 736-5257x3976
piet@cs.ruu.nl (Piet van Oostrum) (09/27/89)
In article <1989Sep24.214153.8867@rpi.edu>, flynn@clotho (Kevin Lincoln Flynn) writes:
`As long as people are talking hashing, can anyone think of a GOOD way to hash
`16-bit numbers determined dynamically at runtime into a fast way to look them
`up? Specifically, I'm talking about the attribute dictionary of an object in
`an OO system we're working on here -- the tags of attributes are 16-bit handles
`about which nothing is known at runtime.
Maybe you should look in the OOPSLA proceedings, to see what kind of
implementations people have tried. (Sorry, I don't have mine handy -- seems
somebody borrowed them, so I can't give you any refs). Also you could wait
a few days and repost your question on comp.object.
--
Piet van Oostrum, Dept of Computer Science, University of Utrecht
Padualaan 14, P.O. Box 80.089, 3508 TB Utrecht, The Netherlands.
Telephone: +31-30-531806 Internet: piet@cs.ruu.nl
Telefax: +31-30-513791 Uucp: uunet!mcvax!hp4nl!ruuinf!piet
bill@twwells.com (T. William Wells) (09/29/89)
In article <3987@yunexus.UUCP> oz@yunexus.UUCP (Ozan Yigit) writes: : In article <1989Sep26.133339.2890@twwells.com> bill@twwells.com (T. William Wells) writes: : >Here's my invention, which is not quite a hash table, but will do : >quite as well: create a binary tree in which the numbers are used to : >control the descent of the tree. : : Nice trie !! :-) Actually, it is nice to be able to invent things (I did : that once or twice), even if they are about a decade old. See "radix : search tries" or "compressed tries" or "digital search tries" . Also see : Knuth, or "Handbook of Algorithms and Data Structures" by G. H. Gonnet, : Addison-Wesley ISBN 0-201-14218-X. I guess I was unclear. The trie (which I know about, most of my data compression work deals with them) is actually a degenerate case of my "invention". The neat trick was in first generating a hash code of something to be stored in a table, then using the hash code to search the trie. (Ugh, I hate that word!) Later, a friend of mine was writing a disassembler and wanted a way to translate addresses to symbols and I realized that the identity hash function :-) would do his job quite nicely. (And this "hash" also preserves ordering, which made the job of finding the symbol below or equal to an address fairly easy.) And that version is what I was suggesting. Anyway, I developed the original hash method by a consideration of the hash method wherein the hash table is doubled when a bucket fills and each element of the table is duplicated adjacent to itself. I realized that a transformation of that would have nodes representing the original entries with children nodes representing the newly duplicated nodes. Unlike that method, using the trie means that a full bucket only affects the current subtree, resulting in both space and time savings, with the drawback of doing memory allocations. This is my standard hash routine. Maybe someday I'll stop being a perfectionist about the code and release it. --- Bill { uunet | novavax | ankh | sunvice } !twwells!bill bill@twwells.com
oz@yunexus.UUCP (Ozan Yigit) (09/30/89)
In article <1989Sep29.021823.12598@twwells.com> bill@twwells.com (T. William Wells) writes: >I guess I was unclear. The trie (which I know about, most of my data >compression work deals with them) is actually a degenerate case of my >"invention". The neat trick was in first generating a hash code of >something to be stored in a table, then using the hash code to search >the trie. Well, maybe I was unclear.... It is a pretty useful, but still, a very old trick, dating back to 1978: Per-Ake Larson "Dynamic Hashing", BIT 18 (1978). This is exactly what dbm/ndbm, and my pd sdbm uses. Details of ndbm's way of doing this trie traversal via the hash value has been discussed in this group before. >.. using the trie means that a >full bucket only affects the current subtree, resulting in both space >and time savings, with the drawback of doing memory allocations. Right, but you have to be very careful with the hash function: (back to where we started) it has to be "bit-randomizing" [In fact, larson uses a hash-number seeded random binary-number generator] so that the trie will not "one way branch" due to a large number of common bits in the hash values. oz -- The king: If there's no meaning Usenet: oz@nexus.yorku.ca in it, that saves a world of trouble ......!uunet!utai!yunexus!oz you know, as we needn't try to find any. Bitnet: oz@[yulibra|yuyetti] Lewis Carroll (Alice in Wonderland) Phonet: +1 416 736-5257x3976