[net.unix-wizards] Discussion of "dbm" data base system

v.wales%ucla-locus@sri-unix.UUCP (01/09/84)

From:            Rich Wales <v.wales@ucla-locus>

As part of a local effort here at UCLA to find a good set of random-
access file routines for UNIX, I recently did an analysis of the "dbm"
data base package distributed as part of 4.1BSD (as well as several
other UNIX versions, I believe).  Although the "dbm" algorithm is very
attractive in many respects, it suffers from some very serious flaws
which -- in my opinion -- make it unsuitable for medium- or large-sized
data bases.

Below is a description of the "dbm" algorithm and my conclusions.  I am
sending this out because I assume it will be of interest of lots of peo-
ple.  Also, I would appreciate any comments on my analysis or conclu-
sions.  Specifically:

(1) Does anyone know who originally developed this algorithm, and where
    (if anywhere) it has been published?  (I have heard a rumor that it
    was published in CACM some time ago, but I haven't yet had time to
    go looking for it.)

(2) Are there any hash-coding experts out there who can explain the
    theoretical basis of "dbm"'s hashing scheme?  (There is not a soli-
    tary shred of commentary in the entire source file, and the hashing
    routine seems to depend on what appear for all the world to be a
    set of totally random numbers.)

(3) Has anyone else tried to debug or "clean up" the "dbm" package?  If
    so, I would be interested in hearing about your fixes and the extent
    (if any) to which they might make "dbm" more usable.

(4) Does anyone have a better random-access file package -- a B-tree
    package, for instance -- that they would be willing to share?

-- Rich Wales <v.wales@UCLA-LOCUS>

========================================================================

		ANALYSIS OF THE "dbm" DATA BASE PACKAGE

"dbm" is a simple data-base system which allows storage and retrieval of
essentially arbitrary-length key/data pairs.  Both keys and data may be
arbitrary byte strings (not necessarily ASCII).  In a small- to medium-
sized data base, a given record can be located (or determined not to
exist at all) in about one disk read.

A "dbm" data base consists of two files -- a "data" file (whose name
ends in ".pag") and a "bit-map" file (whose name ends in ".dir").

(1) The data file consists of 1024-byte pages, each of which may contain
    zero or more records.  There is a size restriction:  any given rec-
    ord must fit entirely in a single page.  Further (as discussed in
    more detail below), it turns out that all the records whose keys
    hash to the same value must fit together in a single page as well.

(2) The bit-map file is a two-dimensional triangular array of bits.  For
    a given first subscript "m" (which can be any non-negative integer),
    the second subscript "n" can range from 0 to (2**m)-1.  This two-
    dimensional array is linearized into a one-dimensional array for
    storage purposes via the function B(m,n) = (2**m)+n-1.  (Bits which
    would be stored beyond the actual end of the bit-map file are as-
    sumed to be zero.)  The interpretation of the bits in the bit-map
    file is described later.

"dbm" uses a hashing scheme to place and locate records.  The hashing
algorithm used maps the key into a 32-bit integer.  At the present time,
I have not figured out the theoretical basis of the hashing algorithm,
and unfortunately the "dbm" source itself is totally undocumented!

The low-order "m" bits of the hash code (for a variable value of "m")
are used as the page number in the data file.  Assuming that the bit-map
file has been set up correctly (see below), the algorithm is as follows
(written in a C-like pseudo-code):

	    h = hash_code (key);
	    m = 0;
	    n = h & ((2**m)-1);
	    while (bit_map[m,n] == 1)
	    {   m = m+1;
		n = h & ((2**m)-1);
	    }
	    /* "n" is the desired page number.  "m" is saved for later
	     * reference in case page "n" fills up and must be split.
	     */

Initially, the entire bit-map file contains zeros; hence, the first
record or records will all go into page 0 of the data file.  Eventually,
though, page 0 will fill up.  When this happens, bit 0 of the bit-map
file (i.e., bit_map[0,0]) is set to 1, and all records in page 0 whose
keys hash to an odd number are moved to page 1.  The effect of turning
on bit 0 of the bit-map file is that subsequent records will be placed
(or searched for) in either page 0 or page 1, depending on the low-order
bit of the hash code.

As more and more records are placed in the data base, more and more
pages of the data file are split in a manner similar to the above exam-
ple.  In general, if page "n" needs to be split, the bit "bit_map[m,n]"
is set to 1, and all records in page "n" for which the m'th bit of the
hash code is a 1 (i.e., hash_code(key) & (2**m) != 0) are moved to the
page numbered n+(2**m).

An informal interpretation of the bit map is that, if the low-order "m"
bits of the hash code are "n", and bit_map[m,n] is 1, the algorithm must
examine at least the next bit of the hash code (by incrementing "m").

Records may be deleted from the data base.  Deletion of records does not
affect the bit-map file in any way, but the space formerly occupied by a
deleted record can be reused by subsequent additions.

The "dbm" scheme sounds very elegant at first glance, but closer exami-
nation reveals two critically serious difficulties:

(1) In certain cases where hash codes differ only in the high-order bit
    or bits, the page-splitting process could easily create an unaccept-
    ably huge (though sparse) file.  Since most UNIX utilities (includ-
    ing "dump" and "tar") do not understand sparse files, this creates
    a severe problem when trying to make backup copies of a "dbm" data
    base or the file system containing it.

(2) Since the hash code determines the one and only page in which "dbm"
    will look for a record with a given key, all records whose keys hash
    to the same value must fit in a single page of the data file.  Since
    the user has no way of predicting whether this condition will arise,
    and no recourse if it does, I must consider this an unacceptable re-
    striction in the "dbm" design.
    
    To make matters even worse, the "dbm" system as originally distrib-
    uted in 4.1BSD does not check for violations of this restriction.
    Rather, it will go into an endless loop trying to split data-file
    pages in order to make room for a new record.  If the hash code in
    question is very large, the data base may be suddenly transformed
    into a giant sparse file in the course of this vain effort.

In addition to the above fundamental flaws, the package also has a few
other bugs.  One of the most serious (but probably easily fixable) is
that you can't manipulate multiple data bases in a single run.  Not
only does the "dbminit" routine not close the previous set of files be-
fore opening a new set, but the routines that read pages from the ".dir"
and ".pag" files use a caching scheme with "static" internal variables
that "dbminit" cannot reinitialize.

========================================================================

ron%brl-vgr@sri-unix.UUCP (01/10/84)

From:      Ron Natalie <ron@brl-vgr>

I took the DBM subroutines and renamed them and reworked them so that
you can access multiple database files (doing a second dbminit causes
unreliable results) by passing a structure with all the file specific
info in it (i.e. all the things that were static internal variables
before).  It is trivial to do, but I will send the source out to those
who are lazy.

-Ron

thomas@utah-gr.UUCP (Spencer W. Thomas) (01/16/84)

Gosling (of emacs fame) did some work on the dbm package (producing
something called ndbm).  His changes include the ability to have
multiple data bases open at once.  I don't know about the internals, but
ndbm files have 3 files, a ".dat", ".dir", and ".pag" file.  I think he
tried to solve the page-splitting problem by moving the actual data into
a third file, which is pointed to from the second one (the one which, in
dbm, would contain the data), which now contains only keys and pointers,
making it much more likely that all records hashing to one page will fit
into the page.  On the minus side, there is no reclamation of space in
the data file (this is also a plus, if you screw up and want to
recover), so you have to dump and reload the database every now and then.

Can anybody out there enlarge on this?

=Spencer

leichter@yale-com.UUCP (Jerry Leichter) (01/16/84)

The algorithm used in the "dbm" system was described in TODS (Transactions on
Database Systems) about 4 years ago in an article by a couple of people at
IBM (whose name will not come back to me now) title "Extendible Hashing".
It's a beautiful algorithm, and sparked a series of other articles on hashing
schemes that can grow and shrink their hash tables dynamically.  The TODS
article contains some theoretical and simulation results that indicate that
asymtotically extendible hashing and B-trees require about the same amount
of space, on average.  Extendible hashing has the big advantage that it takes
at most 2 disk accesses to get to any data item, independent of the size of
the data-base; B-trees have logarithmically-growing access times, and typically
require 3 or 4 accesses for real-life, large databases.

The problem of blocks filling with records with identical hash keys is an
exact analogue of the B-tree problems with blocks filling with identical
keyed records.  The solution is the same in both cases:  You "logically
extend" the leaf block with a continuation pointer.  This costs you at
least one extra data access, so you hope it is rare.

It can be shown that for large numbers of bits in the hash, the chance of this
happening is very small; of course, the code MUST be there; dbm is just a
poor implementation.

I don't know the origin of the dbm "magic hash function", but if the author
followed the advice of the authors of the "Extendible Hashing" paper, he would
have read a paper on "Universal Classes of Hash Functions", by Carter and
Wegman, which shows how to construct excellent hash functions "almost all the
time".

If you want further information, like the exact references, write.
							-- Jerry
					decvax!yale-comix!leichter leichter@yale