[comp.unix.questions] questions regarding databases created with dbm and ndbm routines

sundar@wheaties.ai.mit.edu (Sundar Narasimhan) (06/09/88)

Hi: I have a question regarding the size of dbm and ndbm databases. In
the bugs section of the man pages to these databases there is a
comment indicating that these database files (the .dir, and .pag
files) cannot be copied, tar'ed etc without filling the holes in them.
Does this comment also apply to the dump program that does file system
backups? 

Also, I have about 1000 records each ~600 bytes. The ndbm database
created from these records is rather big. (see below)

-rw-rw-rw-  1 sundar     196608 Jun  8 03:59 1984/WREQ.dir
-rw-rw-rw-  1 sundar   1604270080 Jun  8 03:59 1984/WREQ.pag

Is this normal for ndbm databases ? How does this compare with
databases created with dbm routines ? 

Although the size is large, the database is fast for lookups. However,
I have a need for a search capability that may need to iterate over
all the records. This tends to be slow. Is there any version of these
routines that is faster than the std. routines in 4.2/3/Sun OS 3.5?

One last question, regarding reclamation of file space upon deletion. 
Why is this hard to do? and has anyone added this capability?

-Sundar
(sundar@wheaties.ai.mit.edu)

chris@mimsy.UUCP (Chris Torek) (06/14/88)

In article <16103@brl-adm.ARPA> sundar@wheaties.ai.mit.edu (Sundar
Narasimhan) writes:
>In the bugs section of the man pages [for dbm and ndbm] there is a
>comment indicating that these database files ... cannot be copied,
>tar'ed etc without filling the holes in them.

(Note that this `merely' wastes disk space.  Also, it is not hard
to write a variant of `cp' that recreates holes.)

>Does this comment also apply to the dump program that does file system
>backups? 

(ref V7/4BSD dump+restor[e])  No.

>Is [huge size] normal for ndbm databases?

The .pag files grow more or less by powers of two (imagine a tree with
the nodes numbered thus:

level	      tree		hash mask (in binary)
0:		0			0000
1:	    0	    1			0001
2:	  0   2   1   3			0011
3:       0 4 2 6 1 5 3 7		0111
	 ------- -------
	  these	   these
	 hash to  hash to
	  XXX0	   XXX1

etc.).  In a fresh database, you start at level 0, and to store an
item, you compute a hash number for it, then use no bits; all items
thus wind up in node 0.  When node zero gets full, you `split' to level
1, recomputing the hash for each item in 0; you now use one bit to move
about half the data to node 1.  If node 0 gets full again, you start
using two bits, and (since the things in it all end with a zero bit)
about half the items move to node 2; if node 1 gets full, you start
using two bits there and about half move to 3.  It takes one bit per
node per level to remember what was split; this bit can be named by
(hash & mask) + mask.  If you get (un)lucky with the hash function or
have enough data, the hashing could lean on one path through the tree,
making the apparent database size much larger.

The heart of the whole thing is the loop

	hash = calculate_hash(item);
	for (hash_mask = 0;; hash_mask = (hash_mask << 1) + 1) {
		split_memory_bit = (hash & hash_mask) + hash_mask;
		if (!bit_is_set(split_memory_bit, split_map_table))
			break;
	}
	block = hash & hash_mask;
	/* item is either in node `block' or not there at all */
	/* read that node and search for it sequentially */

and, of course the hashing function.  (Pseudocode for the split
routine is left as an exercise for the student :-) .)

In dbm and ndbm, tuples are stored as (key,datum) with `key' being the
item hashed above, and the key and its datum being stored next to each
other.  The format of each node (block) is:

  (s = sizeof(short), o_d => `offset of datum', o_k => `offset of _key')
	     0s: item_count (always even)
	     1s: offset of key 0
	     2s: offset of datum 0
	     3s: o_k_1
	     4s: o_d_1
		 ...
	(2n+1)s: o_k_n
	(2n+2)s: o_d_n
		 <free space>
	  o_d_n: key n
	  o_k_n: datum n
		 ...
	  o_d_0: datum 0
	  o_k_0: key 0
		 <end of .pag file block>

The length of any item can be computed as (offset of previous item) -
(offset of this item), where the `offset' of item number -1 is the block
size.

>How does this compare with databases created with dbm routines ? 

In 4.3BSD, dbm is just ndbm with a single static `DBM' variable to hold
the (single) database.  The 4.3BSD ndbm is a tweaked version of the V7
dbm, with the block sizes expanded.  A larger block size means that
fewer splits occur (fewer tree levels used), and larger items can be
stored, but more items must be searched with each lookup.  Fewer splits
is desirable to keep the split table size down; if the whole table fits
in memory, only one disk read ever needs to be done to find an item.
Larger items are desirable for the obvious reason.

>Although the size is large, the database is fast for lookups. However,
>I have a need for a search capability that may need to iterate over
>all the records. This tends to be slow. Is there any version of these
>routines that is faster than the std. routines in 4.2/3/Sun OS 3.5?

Not as far as I know.  In my `multi-keyed' dbm, the format of a block
is different.  Items are still hashed as always, but instead of storing
(key,datum) pairs, when an item is to be a key, I store its datum's
hash value and its datum's `in index' in the item's `out hash' and `out
index'; when an item is to be a datum, it acquires an `in index'.
(Actually, all items get in indices always.)  Each item also has a link
(reference) count since it might be used as a datum by any number of
keys.  But the loop I described as the heart of the algorithm must run
twice---once to hash and find the key, and again to find the datum
given the key's `out hash' and `out index'---so lookups generally take
two disk reads instead of one.

>One last question, regarding reclamation of file space upon deletion. 
>Why is this hard to do?

You would need to `un-split' blocks.  That would not be hard, but
without `ftruncate' you could never shorten the file, and even with
ftruncate, you cannot get the kernel to replace a disk block with a
hole, so it really would not save space after all.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris