[comp.databases] clipper internals

curtin@cbnewse.ATT.COM (C.S.Curtin) (02/23/90)

       
       this posting is for a friend without net access. I will forward
       e-mail and redirect flames > /dev/null.


       dBase/Clipper gurus

       I am doing an application that is having speed problems when
       many  records  are  entered  into  the  database.  The  time
       seek()ing records using the index do not seem to be  linear.
       Sure  with  additional records the seek time should increase
       but the factor seems to be  x**2.   This  generates  several
       questions  for the data base internals gurus to answer.  Are
       there any good books on  the  internals  of  pc  bases  data
       bases? (mainly clipper/dbase)

         1.  How are the index files  structured?   The  data  base
             will  have  approximately 20,000 records and will have
             several indexes active at a time. currently  the  size
             of  the  index  file  is  250K bytes. How is the index
             internally accessed by clipper, since the index cannot
             be  kept  in  core memory?  Does it require clipper to
             access the index file on disk first?

               a.  does anyone have any  hints  on  increasing  the
                   speed of the index access?

               b.  Does index key data type have that  much  to  do
                   with  it?   I  am  assuming  that it just does a
                   string compare between the requested key and the
                   key in the index file.

               c.  What does it cost to index on various data types
                   (ie.   string+int)  or  should  all  indexes  be
                   performed on just string "types" ?  key

         2.  How is  the  skip  command  performed?  What  actually
             happens  when  a skip command is given?  Does the file
             pointer just increment X bytes (whatever the width  of
             the  record)  number  of  bytes, or is the record read
             into the buffer and flushed by clipper?  It appears as
             though the record is read into the buffer and the file
             pointer not just incremented


                     is it?
                             lseek(dbf_fd, (num_skips)*sizeof(record), SEEK_CUR);

                     or?

                             while (num_skips--)
                                     read(dbf_fd, rec, sizeof(struct rec);


       Any assistance on these issues  or  others  related  to  the
       internals  of  clipper/dbase  would  be  appreciated.  I  am
       concerned about the speed of the data access  and  currently
       cannot  see how a system with many records can be maintained
       using either of these databases.

awd@dbase.A-T.COM (Alastair Dallas) (02/27/90)

Congratulations!  You've managed to hit on precisely what my management's
lawyers mean when they speak of "proprietary information."  Sorry for being
flip, but this is in reply to mail that asked question after question
pertaining to the exact nature of Clipper (and by extension dBASE) operations
and there's just no way I can be forthcoming.

I can say that the main speed cost in any PC database system is reading the
disk.  Nothing else (string compare vs numeric compare) comes close to 
affecting the bottom line speed so profoundly as being able to avoid 
"hitting the disk" even once.  Therefore, by keeping your index keys
small you allow the system to pack more of them into a fixed-length block
(dBASE IV supports adjustable block sizes), which ultimately reduces the
number of disk reads (especially for SKIP operations).  If you want to get
really tricky, write code that hashes your key values into a 4-byte long
and index on a UDF that uses this value to build a 4-byte Character string.
That'll let you SKIP 40 times or so without reading another index node.

The other thing I _can_ say is that you might look at Knuth's "Art of 
Computer Programming," Vol. 3: Sorting and Searching.  It describes the
operation of Clipper's and dBASE's indexing in sufficient abstraction so
as not to perturb the lawyers.

Hope it helps.

/alastair/