[mod.compilers] Extendible hashing

johnl@ima.UUCP (02/14/87)
Gus,

I have a small collection of papers on extendible hashing, but I agree
with John that their focus is too much on disk-resident tables.  Instead
of pointing you to the papers, let me give you the general outline of the 
in-core extendible hashing method I plan to use.  It is not very much
more complicated than hashing with a fixed table size.

(1)  Choose your hash function.  It can be any function you want,
     subject to the following constraints:

     a.  It must return a non-negative integer that is chosen from a
         range that is large enough to index a maximal-size table.
         In most languages, a long integer is sufficient, since it
         usually has at least as many bits as a memory address.

     b.  The results of the hash function applied to a reasonable sample
         of key values hould be evenly distributed over the entire range.
         In this algorithm, unlike others, the distribution of the high-
         order bits is more critical than that of the low-order bits.

(2)  Choose your collision-handling mechanism.  Any reasonable method will
     work.  The important thing is that you understand it well enough to
     know how to detect the point at which performance degrades so far that
     you'd be better off expanding the table.  You also should be able to
     find in a reasonable time, for a given hash cell, all the entries that
     hash to that cell.
 
(3)  Choose your page size.  This is the unit by which the table is
     extended or shrunk.  It should probably be a nice round (binary)
     number of cells, like 256.  Call this value PSIZE.

(4)  Choose your page indexing method.  You probably don't want to require
     that successive pages be stored contiguously, but you do want them to
     be retrievable in constant time.  Something like an array of pointers
     to pages seems to be the best compromise.

(5)  Initialize your table to empty by setting the current table size
     (TSIZE) to zero, and setting things up so that no pages are currently
     allocated.  Your insertion algorithm should recognize this
     condition as equivalent to "table full enough for expansion."  You
     also need to keep track of the number of significant bits in the
     table size.  Call this value BITS.  It is easy to keep track of 
     incrementally.

(6)  To look up an entry, compute the hash function of your key and call
     the result HASH.  Extract the BITS most significant bits of HASH.
     if it is larger than TSIZE, divide it by two to drop the least
     significant bit.  This is your hash table index.  Split it into
     a page number and offset, and look for your entry.

(7)  To extend the table, allocate the next sequential page.  Add PSIZE 
     to TSIZE and adjust BITS if necessary.  Let NEWPAGE be the number
     of your new page (we assume the first page is numbered 0).  Let
     OLDPAGE be NEWPAGE divided by 2.  Examine every entry that hashes
     to a cell on OLDPAGE.  For those entries, look at the bit we would
     have discarded in the lookup algorithm before the table was extended.
     This is the BITS'th-most high-order bit of the hash value.  If it is
     zero, leave the entry on OLDPAGE.  Otherwise, move it to NEWPAGE.  
      
(8)  To shrink the table, reverse the process.  Let OLDPAGE be the current
     highest-numbered page, and NEWPAGE be OLDPAGE divided by 2.  Merge all
     entries from NEWPAGE into OLDPAGE by simply dividing their table
     indices by 2, and resolving any collisions that result.  Then reduce
     TSIZE by PSIZE and de-allocate the newly-emptied page.

Best of luck with your implementation.    
                                        -- Jens Dill 
                                           dill@svax.cs.cornell.edu
[This is similar to the scheme that the DBM routines use to manage disk files,
except that DBM splits pages as they fill up and keeps a bit map of them
rather than splitting the page in the middle as this scheme does. - john]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request