[comp.databases] Improving RDB performance

sommar@enea.se (Erland Sommarskog) (06/11/89)
Any RDB gurus out there who have the time to give me a hint?
I hope so. (Please post to comp.os.vms or mail me. I don't
read comp.databases.)

In our system we have one relation that is fairly big, 430000
records, will probably grow to 6-700000 until the end of the
year. (Then we can unload all records from 1988.) In every
day usage records are only added, never modified or erased.
But they are read a lot, and that's were our problem begin.

The records are 102 bytes big and the fields of interest here are:
        Customer id:  integer, 4 bytes
        Depot number: integer, 2 bytes
        Date:         Text 6
        Time:         Text 8
To simplify I'll talk about the to first as id or customer, and the
latter as time. The query I want to speed up is:
        FOR a IN rel WITH Cust_id = a.cust_id AND Depot = a.depot_no AND
              (Date < a.date OR (Date <= a.date AND Time < a.Time))
               SORTED BY a.date DESCENDING, a.time DESCENDING
This gives us the latest records for a customer and when we
press "previous page" we get the records before those and so on.
Also of interest is the analogue:
        FOR a IN rel WITH Cust_id = a.cust_id AND Depot = a.depot_no AND
              (Date > a.date OR (Date >= a.date AND Time > a.Time))
               SORTED BY a.date ASCENDING, a.time ASCENDING
This gives us the records starting from a certain date, but the
most interesting is that this query gives us a hint on what gains
we could get from adding negative time fields to get away from the
descending sorting.
  The amount of records is very different from customer to customer.
The average is 500 records, but one customer (26) has 11200 records.
Number two has only half, number three is down on 3300 records and
so on. For "normal" customer there's no problem. The answer is
there is in some seconds. (And those seconds are mostly TDMS.) But for
customer 26 the response time is 15-30 seconds or even up to two
minutes when the system is loaded. And this could be acceptable,
but... Of course the customer mostly looked at is this one. The second
query is answered more rapidly, but still up to a minute with a load 
on, so a negative time field wouldn't suffice it seems.
  We have a sorted index on id and time. Duplicates are some for reason
allowed, although there are no duplicates and there shouldn't be.
Node size is 1900, page size 4 blocks and fill is set to 85%. The index
is in one storage area and the data in another. Both with mixed page
format.
  I thought the obvious thing was to use the "PLACEMENT VIA index"
feature, so I defined an index on the id only, and used this index
as the placement index. (Didn't seem very good idea to me to use the
time part here, or would it be that?) I set node size to 900, fill
to 70% page size to 2. I placed customers with low id in a seprate area,
since the most commonly looked at customers seemed to be those below 300.
I used uniform page format for all areas. The gain was unnoticeable. Or
even worse, since I couldn't use PCA any longer to measure performance. 
(I got strange errors from SORT/MERGE, probably I ran out of some resource.)
Still, "normal" customer gave an instant answer, whereas customer 26
caused a half-minute delay.
  One thing I noticed was that with this latter configuration the
unique index wasn't used for the forward query, but the placement
index was used here as well, which would mean that ascending or
descending doesn't have any importance anymore in terms of
performance.
  It should be added that the sizes of the areas were very narrow,
which could be one reason for the bad result. I'm using a test copy
of the full-scale database here at our development site, and I wanted 
to be nice to those on the same disk as the database.

What to do? I have some more or less desperate idea, like cutting the
relation in two, one small with the records of the last weeks and one 
with the rest, but I stay away from that if I can. Any ideas, hints,
advice etc to improve the situation is appreciated.
                                                   -- 
Erland Sommarskog - ENEA Data, Stockholm - sommar@enea.se
Bowlers on strike!