[comp.lang.c] Low-Level Database File Formats

RMC100@psuvm.psu.edu (Randy Carraghan) (02/06/90)
This post probably is more appropriate for comp.databases, but it seems the
best hackers 'live' in this group, so...  Has anybody had much experience
tearing apart the structures of popular PC databases (R:base, dBASE, Paradox)?
I've spent a fair amount deciphering the R:base file format and will soon
move on to dBASE.  From what I've seen in Jeff Walden's "File Formats for
Popular PC Software" (1986 edition), dBASE II and III ff's seem pretty
straightforward, but I haven't seen the specs for III+ or IV yet.  Are there
other good references (more up-to-date) for these version and/or Paradox?

Does anybody know the type of encryption used by dBASE for protecting its
files?  R:base's protection was very simple, but I imagine dBASE will be
more of a challenge (perhaps not even possible if DES or some similar
technique is used).  I'd imagine the protection used would be something
much less CPU-intensive, though, as a database's speed is a primary
consideration.

My final question concerns the use of B+ trees.  Keys in the R:base index
file are four bytes.  If the key field is a integer/date/time its value is
the key.  If the key field is a string, the lowest-order bytes are its
first two characters and the highest-order two bytes are a function of the
remaining 3..n characters of the string.  If there is a collision, the
duplicates are stored in another block.  The scheme for key definition seems
very efficient, since only four bytes are used and for strings with the
same first two characters, there is still 65k+ resolution for the remaining
two bytes.  I was under the impression some databases used up to 30
characters (just straight ascii) as the key, but this would allow for much
fewer keys per block.  Is anyone aware of other popular techniques employed?
Also, it's difficult to tell the parameters used by a particular application
for constructing its B+ tree (when to split, distribution of keys when split,
key movement (as in Knuth's B* tree), etc), but it seems that if you were
to recreate a Btree for an application using your own technique, it could
still be used by their algorithm.  Likewise, a Btree created by their
application could still be read by your own.  That is, two algorithms which
create two Btrees in two different ways (resulting in two distinct trees)
could still utilize and reference and 'grow' each other's Btrees.  Is this
the case or am I missing some points?

And finally (I know, I said I was done before...), what is generally the
best type of internal sorting method to use for large (disk-based) files?
Does anyone know the commercially implemented algorithms are?  I've read
about so many different techiques (internal merge or tournament sort followed
one of many varieties of merge sort to disk).

Thanks for any help/advice/references!

Randy Carraghan (rmc100@psuvm.bitnet)