[comp.lang.c] Perfect HASH functions.....

vijay@bradley.UUCP (09/18/89)

Hello,      

   I hope you can help me. Does anyone out there have a perfect
hash function? The key is a string of characters (maximum length
is 4, minimum is 2) that make up the opcode set of an assembler
that I am in the process of writing. I tried some home-brewed
hash functions, but got some collisions. I also tried to read some
articles on the subject in ACM, but you need a PHD in maths to
make any sense out of them!! Any help would be appreciated.

Thanks in advance (you can e-mail me directly, if you please..)
--
Vijay K. Gurbani    | {uiucdcs, inhp4, cepu}!bradley!vijay | "Learn how to 
Distributed Systems | vijay@bradley.edu                    |  avoid RIPOFFS -
Bradley University  | (309)677-2339                        |  Send $10.00 for
Peoria, IL 61605    |                                      |  details!!"

jeffrey@algor2.algorists.com (Jeffrey Kegler) (09/21/89)

In article <9900014@bradley> vijay@bradley.UUCP writes: 

>Does anyone out there have a perfect hash function? The key is a
>string of characters (maximum length is 4, minimum is 2) that make up
>the opcode set of an assembler that I am in the process of writing.
>I also tried to read some articles on the subject in ACM, but you
>need a PHD in maths to make any sense out of them!!

I wrote one of those articles (_Communications of the ACM_, June
1986), so maybe I can help.  Perfect hashing is almost never
practical.  For any number of keys, some type of faster and better
method can be devised.  [ The secret to this is realizing perfect hashes
do not run in constant time. ]  In your case a binary search looks
like a winner.

If you do come up with a perfect hash of your opcodes, and code it up,
what happens when an opcode is added (or deleted)?  You need a new
perfect hash.  It certainly makes the job of maintaining the assembler
interesting.

Another interesting approach for the opcode search is a
self-organizing search, which will take advantage of the fact the some
of the opcodes are used far more often than others.

-- 

Jeffrey Kegler, Independent UNIX Consultant, Algorists, Inc.
jeffrey@algor2.ALGORISTS.COM or uunet!algor2!jeffrey
1762 Wainwright DR, Reston VA 22090

djones@megatest.UUCP (Dave Jones) (09/23/89)

From article <9900014@bradley>, by vijay@bradley.UUCP:
> 
> Hello,      
> 
>    I hope you can help me. Does anyone out there have a perfect
> hash function? The key is a string of characters (maximum length
> is 4, minimum is 2) ...

How about another idea?

For a similar application, I recently wrote a little program which
generates a big switch statement for recognizing strings. You give
it a list of strings, each paired with its enumeration value, and
it spits out the switch.  It's real fast. It helps if you optimize
away jumps to jumps, of course. It only took about an hour to
write the thing. Here's the first few lines of a switch it generated
for recognizing Pascal keywords in lower case:

switch(*str++) /*  */
{ default: break;
  case 'a': /* a */
   switch(*str++) /* a */
   { default: break;
     case 'n': /* an */
      switch(*str++) /* an */
      { default: break;
        case 'd': /* and */
         if(length==3)type = 307;
        break;
      }
     break;
     case 'r': /* ar */
      switch(*str++) /* ar */
      { default: break;
        case 'r': /* arr */
         switch(*str++) /* arr */
         { default: break;
           case 'a': /* arra */
            switch(*str++) /* arra */
            { default: break;
              case 'y': /* array */
               if(length==5)type = 257;
              break;
            }
           break;
         }
        break;
      }
     break;
   }
  break;


etc...

schmidt@glacier.ics.uci.edu (Doug Schmidt) (09/24/89)

In article <9900014@bradley>, vijay@bradley writes:
>
>Hello,      
>
>   I hope you can help me. Does anyone out there have a perfect
>hash function? The key is a string of characters (maximum length
>is 4, minimum is 2) that make up the opcode set of an assembler
>that I am in the process of writing. I tried some home-brewed
>hash functions, but got some collisions. I also tried to read some
>articles on the subject in ACM, but you need a PHD in maths to
>make any sense out of them!! Any help would be appreciated.
>
>Thanks in advance (you can e-mail me directly, if you please..)

If you've got access to anonymous ftp you should pick up my perfect
hash function generator program `gperf.'  It is available from
ics.uci.edu (128.195.1.1) in ~ftp/pub/cperf-1.8.tar.Z.  

Written in C, this program is used to automatically generate the
reserved word lookup routines used in the GNU C, GNU C++, and GNU
Pascal compilers, as well as the GNU indent program.

The distribution comes with extensive documentation, and should
compile easily with standard K&R C compilers.

If you don't have access to anonymous ftp I can send you a version via
email, or you can wait until gperf is posted to comp.sources.unix
(shortly, I hope!).

  Doug
--
schmidt@ics.uci.edu (ARPA) |   Per me si va nella citta' dolente.
office: (714) 856-4043     |   Per me si va nell'eterno dolore.
                           |   Per me si va tra la perduta gente.
                           |   Lasciate ogni speranza o voi ch'entrate.

flynn@clotho.acm.rpi.edu (Kevin Lincoln Flynn) (09/25/89)

As long as people are talking hashing, can anyone think of a GOOD way to hash
16-bit numbers determined dynamically at runtime into a fast way to look them
up?  Specifically, I'm talking about the attribute dictionary of an object in
an OO system we're working on here -- the tags of attributes are 16-bit handles
about which nothing is known at runtime.  Someone suggested closed hashing,
with the number of buckets growing by roughly doubling primes....  ie start
with two buckets, then 3, then 7, then 17, etc.....  any ideas out there?
(oh yeah, the hash function as well as the strategy'd by nice.  [ :) ])

  Thanks in advance!!
    - Flynn
Kevin Lincoln Flynn    flynn@acm.rpi.edu, userfwvl@mts.rpi.edu
147 1st Street         H (518) 273-6914  W (518) 447-8561
Troy, NY  12180        
...Argue for your limitations, and sure enough they're yours.

bill@twwells.com (T. William Wells) (09/26/89)

In article <1989Sep24.214153.8867@rpi.edu> flynn@acm.rpi.edu (Kevin Lincoln Flynn) writes:
: As long as people are talking hashing, can anyone think of a GOOD way to hash
: 16-bit numbers determined dynamically at runtime into a fast way to look them
: up?  Specifically, I'm talking about the attribute dictionary of an object in
: an OO system we're working on here -- the tags of attributes are 16-bit handles
: about which nothing is known at runtime.  Someone suggested closed hashing,
: with the number of buckets growing by roughly doubling primes....  ie start
: with two buckets, then 3, then 7, then 17, etc.....  any ideas out there?
: (oh yeah, the hash function as well as the strategy'd by nice.  [ :) ])

Here's my invention, which is not quite a hash table, but will do
quite as well: create a binary tree in which the numbers are used to
control the descent of the tree. In other words, if you had three
numbers (of four bits): 0100, 0110, 1000, your tree would look like
this:

		      *
		     / \
		    *  1000
		   / \
		0100 0110

To find 0110, look at the left bit, a zero, which means go left; the
next bit, a 1 means go right, and then, since we are at a leaf, we
are done. If the item is not in the tree, one runs into a leaf that
is not equal to the item or a null leaf pointer. Both insertion and
deletion should be fairly obvious. The times for search, insert, and
delete are all bounded and are linear in the maximum depth of the
tree.

There are no collisions and it is very fast to search. In fact, the
only drawback I know of is that you really should dynamically allocate
the node structures, unless you are willing to accept a difficult to
compute upper bound on the number of things you can store in the tree.
You can, however, add additional information to the nodes and keep the
number of nodes to no more than 50%, more or less, of the number of
leaves.

There are, of course, many variations on this. The one I invented it
with generates hash code for strings, uses the hash code to control
the tree traversal, and stores the items in small buckets at the
leaves.

---
Bill                    { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com

djones@megatest.UUCP (Dave Jones) (09/27/89)

From article <1989Sep24.214153.8867@rpi.edu>, by flynn@clotho.acm.rpi.edu (Kevin Lincoln Flynn):
> As long as people are talking hashing, can anyone think of a GOOD way to hash
> 16-bit numbers determined dynamically at runtime into a fast way to look them
> up?  Specifically, I'm talking about the attribute dictionary of an object in
> an OO system we're working on here -- the tags of attributes are 16-bit handles
> about which nothing is known at runtime.  Someone suggested closed hashing,
> with the number of buckets growing by roughly doubling primes....  ie start
> with two buckets, then 3, then 7, then 17, etc.....  any ideas out there?
> (oh yeah, the hash function as well as the strategy'd by nice.  [ :) ])
> 


Reluctant as I am to contribute to your delinquency by aiding and abetting
dynamic attribute binding, well okay. :-|

For a hash-function, it depends on the form of those 16-bit numbers.
If all the bits are more or less equally important, you may just want
to multiply by a peculiar number. Try 9821. Read Sedgewick's "Algorithms"
or something by Knuth on pseudo-random number generation. Do some experiments
to see what kind of collision rates you get with different numbers and
different table-sizes. Numbers that end with x21, where x is even may avoid
some problems. I don't pretend to understand this.

Now then. Concerning the strategy....

I have never been convinced that it is advantageous to use prime numbers
for sizing hash tables. Nobody has ever been able to do much more by way
of convincing than to wave books at me. But when you read those books, they
just say it's a Good Thing. No analysis of why it is a Good Thing. My
reasoning is that if the hash-function is good, then the algebraic properties
of the size of the table don't matter. Random squared is random. Or in this
case, pseudo-random squared is pseudo-random.

Why am I telling you all this?

Well, even on some modern machines, dividing is expensive. I just ran a
little test on my Sun 3/60 and it did 10,000,000 bitwise "&" operations
five and a half times as fast as it did the same number of divide
operations.

I have one hash-table package that I use almost exclusively
and it seems to give very good results. It keeps tables that are
powers of two. Therefore to mod out the size of the table, it masks the
hash-value with one & operation. I keep the table strictly less than
half full, in order to keep the number of collisions low.

The less-than-half-full property allows me to use a rather peculiar
rehashing scheme that I dreamed up. It is based on a very remarkable
property that I discovered, I don't know how. Too much coffee, probably.
The property is that if M is a power of two, then the sequence

               S(0) = 0
               S(n+1) = ((S(n)+1)*3) % M

repeats after exactly M/2 steps. Since the "% M" can be done with
an & operation, and the "*3" can be done with a shift and add, this
is a fast rehash to calculate. And the algorithm does not have to check to
see if it has cycled, because by keeping the table half full, we have
guaranteed that the this sequence will have an empty slot in it.

I've got a routine that puts something in the table if nothing else
equivalent was there, or returns the equivalent thing. And a routine
that replaces the equivalent thing and returns the old value, a routine
that just looks for something, etc., etc...  I even have a routine that
removes entries. The average time (which should be always, if the hash-
function is good) is O(1) for both insertion and removal. Worst case for
insertion is O(n) and for removal is O(n**2), but that doesn't matter,
of course.  Here's a sample. (The way the bit-mask for modding out the
power of two is calculated assumes a two's complement machine. I don't
ever expect to use the other kind.)

And oh yes. Please forgive the dynamic binding of the "hash" and "eq"
attributes. In the OO attribute looker-upper you will no doubt
want to do the right thing and expand the hash and eqivalence functions
in line.


#define HASH(cont) (((*(obj->hash))(cont)) & obj->mask )
#define REHASH(num)  (((((num)+1)*3) & obj->mask) )

static Ptr* new_table();
static int round_up_to_pwr2();
static void Hash_overflow();

Hash*
Hash_init_sized(obj, hash, eq, initial_table_size)
     register Hash* obj;
     int (*hash)();
     Bool (*eq)();
{
  if(initial_table_size < 4) initial_table_size = 4;
  obj->size = round_up_to_pwr2(initial_table_size);
  obj->num_entries = 0;
  obj->max_entries = obj->size/2 - 1;
  obj->mask = obj->size - 1;
  obj->hash_table = new_table(obj->size);
  obj->hash = hash;
  obj->eq = eq;
  return obj;
}

/* Put a new entry into the table. If there was previously an
 * equivalent entry in the table, return it.
 * If there was not, return (Ptr)0.
 */


Ptr
Hash_put(obj, entry )
  register Hash* obj;
  Ptr entry;
{

  register int bucket_number;
  register Ptr* bucket;

  bucket_number = HASH(entry);

  while(1)
    {
      bucket = obj->hash_table + bucket_number;

      if ( *bucket == (Ptr)0 )
	{ 
	  *bucket = entry;
	  obj->num_entries++;
	  if ( obj->num_entries > obj->max_entries )
	    Hash_overflow(obj);
	  return (Ptr)0;  /* <======== added new entry */
	}
      
      if ( !obj->eq( entry, *bucket ) )
	{ 
	  bucket_number = REHASH(bucket_number);
	  continue; /* <====== search some more (collision) */
	}

      /* Found old Ptr. Replace. */
      { 
	Ptr old = *bucket;
	*bucket = entry;
	return old; /* <============== replaced entry */
      }

    }
  
}

oz@yunexus.UUCP (Ozan Yigit) (09/27/89)

In article <1989Sep26.133339.2890@twwells.com> bill@twwells.com (T. William Wells) writes:
>Here's my invention, which is not quite a hash table, but will do
>quite as well: create a binary tree in which the numbers are used to
>control the descent of the tree. 

Nice trie !! :-) Actually, it is nice to be able to invent things (I did
that once or twice), even if they are about a decade old. See "radix
search tries" or "compressed tries" or "digital search tries" . Also see
Knuth, or "Handbook of Algorithms and Data Structures" by G. H. Gonnet,
Addison-Wesley ISBN 0-201-14218-X.

oz
-- 
The king: If there's no meaning	   	    Usenet:    oz@nexus.yorku.ca
in it, that saves a world of trouble        ......!uunet!utai!yunexus!oz
you know, as we needn't try to find any.    Bitnet: oz@[yulibra|yuyetti]
Lewis Carroll (Alice in Wonderland)         Phonet: +1 416 736-5257x3976

piet@cs.ruu.nl (Piet van Oostrum) (09/27/89)

In article <1989Sep24.214153.8867@rpi.edu>, flynn@clotho (Kevin Lincoln Flynn) writes:
 `As long as people are talking hashing, can anyone think of a GOOD way to hash
 `16-bit numbers determined dynamically at runtime into a fast way to look them
 `up?  Specifically, I'm talking about the attribute dictionary of an object in
 `an OO system we're working on here -- the tags of attributes are 16-bit handles
 `about which nothing is known at runtime. 

Maybe you should look in the OOPSLA proceedings, to see what kind of
implementations people have tried. (Sorry, I don't have mine handy -- seems
somebody borrowed them, so I can't give you any refs). Also you could wait
a few days and repost your question on comp.object.
-- 
Piet van Oostrum, Dept of Computer Science, University of Utrecht
Padualaan 14, P.O. Box 80.089, 3508 TB Utrecht,  The Netherlands.
Telephone: +31-30-531806      Internet: piet@cs.ruu.nl
Telefax:   +31-30-513791      Uucp: uunet!mcvax!hp4nl!ruuinf!piet

bill@twwells.com (T. William Wells) (09/29/89)

In article <3987@yunexus.UUCP> oz@yunexus.UUCP (Ozan Yigit) writes:
: In article <1989Sep26.133339.2890@twwells.com> bill@twwells.com (T. William Wells) writes:
: >Here's my invention, which is not quite a hash table, but will do
: >quite as well: create a binary tree in which the numbers are used to
: >control the descent of the tree.
:
: Nice trie !! :-) Actually, it is nice to be able to invent things (I did
: that once or twice), even if they are about a decade old. See "radix
: search tries" or "compressed tries" or "digital search tries" . Also see
: Knuth, or "Handbook of Algorithms and Data Structures" by G. H. Gonnet,
: Addison-Wesley ISBN 0-201-14218-X.

I guess I was unclear. The trie (which I know about, most of my data
compression work deals with them) is actually a degenerate case of my
"invention". The neat trick was in first generating a hash code of
something to be stored in a table, then using the hash code to search
the trie. (Ugh, I hate that word!) Later, a friend of mine was
writing a disassembler and wanted a way to translate addresses to
symbols and I realized that the identity hash function :-) would do
his job quite nicely. (And this "hash" also preserves ordering, which
made the job of finding the symbol below or equal to an address
fairly easy.) And that version is what I was suggesting.

Anyway, I developed the original hash method by a consideration of
the hash method wherein the hash table is doubled when a bucket fills
and each element of the table is duplicated adjacent to itself. I
realized that a transformation of that would have nodes representing
the original entries with children nodes representing the newly
duplicated nodes. Unlike that method, using the trie means that a
full bucket only affects the current subtree, resulting in both space
and time savings, with the drawback of doing memory allocations.

This is my standard hash routine. Maybe someday I'll stop being a
perfectionist about the code and release it.

---
Bill                    { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com

oz@yunexus.UUCP (Ozan Yigit) (09/30/89)

In article <1989Sep29.021823.12598@twwells.com> bill@twwells.com (T. William Wells) writes:
>I guess I was unclear. The trie (which I know about, most of my data
>compression work deals with them) is actually a degenerate case of my
>"invention". The neat trick was in first generating a hash code of
>something to be stored in a table, then using the hash code to search
>the trie. 

Well, maybe I was unclear.... It is a pretty useful, but still, a very
old trick, dating back to 1978: Per-Ake Larson "Dynamic Hashing", BIT
18 (1978). This is exactly what dbm/ndbm, and my pd sdbm uses. Details
of ndbm's way of doing this trie traversal via the hash value has been 
discussed in this group before.

>.. using the trie means that a
>full bucket only affects the current subtree, resulting in both space
>and time savings, with the drawback of doing memory allocations.

Right, but you have to be very careful with the hash function: (back to
where we started) it has to be "bit-randomizing" [In fact, larson uses a
hash-number seeded random binary-number generator] so that the trie will
not "one way branch" due to a large number of common bits in the hash
values.

oz
-- 
The king: If there's no meaning	   	    Usenet:    oz@nexus.yorku.ca
in it, that saves a world of trouble        ......!uunet!utai!yunexus!oz
you know, as we needn't try to find any.    Bitnet: oz@[yulibra|yuyetti]
Lewis Carroll (Alice in Wonderland)         Phonet: +1 416 736-5257x3976