v.wales%ucla-locus@sri-unix.UUCP (01/09/84)
From: Rich Wales <v.wales@ucla-locus>
As part of a local effort here at UCLA to find a good set of random-
access file routines for UNIX, I recently did an analysis of the "dbm"
data base package distributed as part of 4.1BSD (as well as several
other UNIX versions, I believe). Although the "dbm" algorithm is very
attractive in many respects, it suffers from some very serious flaws
which -- in my opinion -- make it unsuitable for medium- or large-sized
data bases.
Below is a description of the "dbm" algorithm and my conclusions. I am
sending this out because I assume it will be of interest of lots of peo-
ple. Also, I would appreciate any comments on my analysis or conclu-
sions. Specifically:
(1) Does anyone know who originally developed this algorithm, and where
(if anywhere) it has been published? (I have heard a rumor that it
was published in CACM some time ago, but I haven't yet had time to
go looking for it.)
(2) Are there any hash-coding experts out there who can explain the
theoretical basis of "dbm"'s hashing scheme? (There is not a soli-
tary shred of commentary in the entire source file, and the hashing
routine seems to depend on what appear for all the world to be a
set of totally random numbers.)
(3) Has anyone else tried to debug or "clean up" the "dbm" package? If
so, I would be interested in hearing about your fixes and the extent
(if any) to which they might make "dbm" more usable.
(4) Does anyone have a better random-access file package -- a B-tree
package, for instance -- that they would be willing to share?
-- Rich Wales <v.wales@UCLA-LOCUS>
========================================================================
ANALYSIS OF THE "dbm" DATA BASE PACKAGE
"dbm" is a simple data-base system which allows storage and retrieval of
essentially arbitrary-length key/data pairs. Both keys and data may be
arbitrary byte strings (not necessarily ASCII). In a small- to medium-
sized data base, a given record can be located (or determined not to
exist at all) in about one disk read.
A "dbm" data base consists of two files -- a "data" file (whose name
ends in ".pag") and a "bit-map" file (whose name ends in ".dir").
(1) The data file consists of 1024-byte pages, each of which may contain
zero or more records. There is a size restriction: any given rec-
ord must fit entirely in a single page. Further (as discussed in
more detail below), it turns out that all the records whose keys
hash to the same value must fit together in a single page as well.
(2) The bit-map file is a two-dimensional triangular array of bits. For
a given first subscript "m" (which can be any non-negative integer),
the second subscript "n" can range from 0 to (2**m)-1. This two-
dimensional array is linearized into a one-dimensional array for
storage purposes via the function B(m,n) = (2**m)+n-1. (Bits which
would be stored beyond the actual end of the bit-map file are as-
sumed to be zero.) The interpretation of the bits in the bit-map
file is described later.
"dbm" uses a hashing scheme to place and locate records. The hashing
algorithm used maps the key into a 32-bit integer. At the present time,
I have not figured out the theoretical basis of the hashing algorithm,
and unfortunately the "dbm" source itself is totally undocumented!
The low-order "m" bits of the hash code (for a variable value of "m")
are used as the page number in the data file. Assuming that the bit-map
file has been set up correctly (see below), the algorithm is as follows
(written in a C-like pseudo-code):
h = hash_code (key);
m = 0;
n = h & ((2**m)-1);
while (bit_map[m,n] == 1)
{ m = m+1;
n = h & ((2**m)-1);
}
/* "n" is the desired page number. "m" is saved for later
* reference in case page "n" fills up and must be split.
*/
Initially, the entire bit-map file contains zeros; hence, the first
record or records will all go into page 0 of the data file. Eventually,
though, page 0 will fill up. When this happens, bit 0 of the bit-map
file (i.e., bit_map[0,0]) is set to 1, and all records in page 0 whose
keys hash to an odd number are moved to page 1. The effect of turning
on bit 0 of the bit-map file is that subsequent records will be placed
(or searched for) in either page 0 or page 1, depending on the low-order
bit of the hash code.
As more and more records are placed in the data base, more and more
pages of the data file are split in a manner similar to the above exam-
ple. In general, if page "n" needs to be split, the bit "bit_map[m,n]"
is set to 1, and all records in page "n" for which the m'th bit of the
hash code is a 1 (i.e., hash_code(key) & (2**m) != 0) are moved to the
page numbered n+(2**m).
An informal interpretation of the bit map is that, if the low-order "m"
bits of the hash code are "n", and bit_map[m,n] is 1, the algorithm must
examine at least the next bit of the hash code (by incrementing "m").
Records may be deleted from the data base. Deletion of records does not
affect the bit-map file in any way, but the space formerly occupied by a
deleted record can be reused by subsequent additions.
The "dbm" scheme sounds very elegant at first glance, but closer exami-
nation reveals two critically serious difficulties:
(1) In certain cases where hash codes differ only in the high-order bit
or bits, the page-splitting process could easily create an unaccept-
ably huge (though sparse) file. Since most UNIX utilities (includ-
ing "dump" and "tar") do not understand sparse files, this creates
a severe problem when trying to make backup copies of a "dbm" data
base or the file system containing it.
(2) Since the hash code determines the one and only page in which "dbm"
will look for a record with a given key, all records whose keys hash
to the same value must fit in a single page of the data file. Since
the user has no way of predicting whether this condition will arise,
and no recourse if it does, I must consider this an unacceptable re-
striction in the "dbm" design.
To make matters even worse, the "dbm" system as originally distrib-
uted in 4.1BSD does not check for violations of this restriction.
Rather, it will go into an endless loop trying to split data-file
pages in order to make room for a new record. If the hash code in
question is very large, the data base may be suddenly transformed
into a giant sparse file in the course of this vain effort.
In addition to the above fundamental flaws, the package also has a few
other bugs. One of the most serious (but probably easily fixable) is
that you can't manipulate multiple data bases in a single run. Not
only does the "dbminit" routine not close the previous set of files be-
fore opening a new set, but the routines that read pages from the ".dir"
and ".pag" files use a caching scheme with "static" internal variables
that "dbminit" cannot reinitialize.
========================================================================
ron%brl-vgr@sri-unix.UUCP (01/10/84)
From: Ron Natalie <ron@brl-vgr> I took the DBM subroutines and renamed them and reworked them so that you can access multiple database files (doing a second dbminit causes unreliable results) by passing a structure with all the file specific info in it (i.e. all the things that were static internal variables before). It is trivial to do, but I will send the source out to those who are lazy. -Ron
thomas@utah-gr.UUCP (Spencer W. Thomas) (01/16/84)
Gosling (of emacs fame) did some work on the dbm package (producing something called ndbm). His changes include the ability to have multiple data bases open at once. I don't know about the internals, but ndbm files have 3 files, a ".dat", ".dir", and ".pag" file. I think he tried to solve the page-splitting problem by moving the actual data into a third file, which is pointed to from the second one (the one which, in dbm, would contain the data), which now contains only keys and pointers, making it much more likely that all records hashing to one page will fit into the page. On the minus side, there is no reclamation of space in the data file (this is also a plus, if you screw up and want to recover), so you have to dump and reload the database every now and then. Can anybody out there enlarge on this? =Spencer
leichter@yale-com.UUCP (Jerry Leichter) (01/16/84)
The algorithm used in the "dbm" system was described in TODS (Transactions on Database Systems) about 4 years ago in an article by a couple of people at IBM (whose name will not come back to me now) title "Extendible Hashing". It's a beautiful algorithm, and sparked a series of other articles on hashing schemes that can grow and shrink their hash tables dynamically. The TODS article contains some theoretical and simulation results that indicate that asymtotically extendible hashing and B-trees require about the same amount of space, on average. Extendible hashing has the big advantage that it takes at most 2 disk accesses to get to any data item, independent of the size of the data-base; B-trees have logarithmically-growing access times, and typically require 3 or 4 accesses for real-life, large databases. The problem of blocks filling with records with identical hash keys is an exact analogue of the B-tree problems with blocks filling with identical keyed records. The solution is the same in both cases: You "logically extend" the leaf block with a continuation pointer. This costs you at least one extra data access, so you hope it is rare. It can be shown that for large numbers of bits in the hash, the chance of this happening is very small; of course, the code MUST be there; dbm is just a poor implementation. I don't know the origin of the dbm "magic hash function", but if the author followed the advice of the authors of the "Extendible Hashing" paper, he would have read a paper on "Universal Classes of Hash Functions", by Carter and Wegman, which shows how to construct excellent hash functions "almost all the time". If you want further information, like the exact references, write. -- Jerry decvax!yale-comix!leichter leichter@yale