[net.unix-wizards] 'dbm' hash table software

massar@think.UUCP (07/15/86)

Does anyone out there understand dbm?

The manual page says:

   The sum of the sizes of a key/content pair must not exceed the internal
   block size (currently 1024 bytes).  Moreover all key/content pairs that
   hash together must fit on a single block.  Store will return an error in
   the event that a disk block fills with inseparable data.

However, after looking at the source and having had some experience using
dbm I am not convinced that the last sentence is really implemented.
Further, if it is implemented, what exactly does it mean?  Will my
already existent data be destroyed as dbm 'fills the disk block with
inseparable data' or will dbm not write the data that would cause the
disk block to fill up?

The reason I am suspicious is that after creating a large hash table
and apparently not getting any errors, certain entries in the database
don't exist, for no reason I understand.

While I am at it, does anyone know of a dbm-like implementation without
this limitation on filling up disk blocks?

-- 
-- JP Massar, Thinking Machines Corporation, Cambridge, MA
-- ihnp4!think!massar, massar@think.com
-- 617-876-1111

jordan@nike.UUCP (07/16/86)

	The reason I am suspicious is that after creating a large hash
	table and apparently not getting any errors, certain entries in
	the database don't exist, for no reason I understand.

dbm was a hack to get 4.2 out the door -- 4.3 has a new implementation
that does the right thing (i.e., dbm_open() returns a DBM *) and the size
restriction was raised to 4096 bytes ... however, I think there
are still some inconsistancies in the man page.

/jordan

chris@umcp-cs.UUCP (07/17/86)

In article <385@nike.UUCP> jordan@nike.UUCP (Jordan Hayes) writes:
>dbm was a hack to get 4.2 out the door ...

Not quite: dbm was in V7, and was apparently written by Ken Thompson
in some connection with a chess program.

>4.3 has a new implementation that does the right thing (i.e.,
>dbm_open() returns a DBM *) and the size restriction was raised to
>4096 bytes ...

Why is this `the right thing'?

My `mdbm' implementation allows both the data and the bitmap block
size to be specified at database creation.  A larger data block
allows storing larger items and/or more that hash to the same value,
but also decreases `store' performance.

In any case, to answer the original question, when store() runs
out of room in a data block, it splits it by using one more hash
bit.  This normally results in two nearly-half-full blocks, and
leaves enough room for the new item.  In pathological cases, dbm
may have to split blocks some number of times before discovering
that there is no way to fit the new item into its eventual destination.
This happens only when several large items hash to the same value,
which is rather rare, and in this case the new item is left out of
the database and an error is returned.  The database might then
appear extremely large (as much as 2^32 bytes), but should still
be properly formed internally.

Of course, if the operating system rejects any particular transfer,
for whatever reason, the database can become corrupted.  In such
cases store should also return an error.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1516)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu

shannon@sun.uucp (Bill Shannon) (07/19/86)

> dbm was a hack to get 4.2 out the door

dbm has nothing to do with getting 4.2 out the door.  dbm has
been around since V7, although AT&T seems to have lost their
copy...

ian@sq.UUCP (07/21/86)

In article <5273@sun.uucp> shannon@sun.uucp (Bill Shannon) writes:
>> dbm was a hack to get 4.2 out the door
>
>dbm has nothing to do with getting 4.2 out the door.  dbm has
>been around since V7, although AT&T seems to have lost their
>copy...

AT&T (USG) never got a copy from Bell, since System III -> V -> Vr2 -> \(if
doesn't come from V7, but from V6. (I think we've been over this before;
apologies to those who've heard it before :=) )

guy@sun.uucp (Guy Harris) (07/22/86)

> AT&T (USG) never got a copy (of "dbm") from Bell, since System III -> V
> -> Vr2 -> \(if doesn't come from V7, but from V6. (I think we've been
> over this before; apologies to those who've heard it before :=) )

If UNIX/TS 1.0 -> PWB/UNIX 2.0 -> UNIX 3.0 -> UNIX 3.0.1 (S3) -> ... came
from V6, how come it has:

	1) The V7 file system

	2) The V7 "stat" structure

	3) The V7 "lseek" system call

	4) Environment variables

	5) The Bourne shell

etc., when V6 had *none* of those?  The USG/USDL UNIX series came from a
UNIX somewhere between what was sent out the door as V6 and what was sent
out the door as V7; it was closer to V7 than to V6, though.  (The fact that
it lacked "dbm" and a few other things is hardly as important as the fact
that it had the V7 features listed above.)

I'm not sure what you mean by "AT&T" and "Bell" here.  It's not relevant,
though; the Research group that did V6, V7, etc. and the USG/USDL that did
S3, S5, etc. were both part of Bell Laboratories, so they *could* have
gotten a copy of "dbm" if they'd wanted to.  The pipeline between Research
and USG/USDL was less than smooth (consider that a bug in "fgrep", fixed in
an addendum to V7, is *still* not fixed in S5!), but it could conceivably
have been made to flow better.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com (or guy@sun.arpa)