sommar@enea.se (Erland Sommarskog) (06/11/89)
Any RDB gurus out there who have the time to give me a hint? I hope so. (Please post to comp.os.vms or mail me. I don't read comp.databases.) In our system we have one relation that is fairly big, 430000 records, will probably grow to 6-700000 until the end of the year. (Then we can unload all records from 1988.) In every day usage records are only added, never modified or erased. But they are read a lot, and that's were our problem begin. The records are 102 bytes big and the fields of interest here are: Customer id: integer, 4 bytes Depot number: integer, 2 bytes Date: Text 6 Time: Text 8 To simplify I'll talk about the to first as id or customer, and the latter as time. The query I want to speed up is: FOR a IN rel WITH Cust_id = a.cust_id AND Depot = a.depot_no AND (Date < a.date OR (Date <= a.date AND Time < a.Time)) SORTED BY a.date DESCENDING, a.time DESCENDING This gives us the latest records for a customer and when we press "previous page" we get the records before those and so on. Also of interest is the analogue: FOR a IN rel WITH Cust_id = a.cust_id AND Depot = a.depot_no AND (Date > a.date OR (Date >= a.date AND Time > a.Time)) SORTED BY a.date ASCENDING, a.time ASCENDING This gives us the records starting from a certain date, but the most interesting is that this query gives us a hint on what gains we could get from adding negative time fields to get away from the descending sorting. The amount of records is very different from customer to customer. The average is 500 records, but one customer (26) has 11200 records. Number two has only half, number three is down on 3300 records and so on. For "normal" customer there's no problem. The answer is there is in some seconds. (And those seconds are mostly TDMS.) But for customer 26 the response time is 15-30 seconds or even up to two minutes when the system is loaded. And this could be acceptable, but... Of course the customer mostly looked at is this one. The second query is answered more rapidly, but still up to a minute with a load on, so a negative time field wouldn't suffice it seems. We have a sorted index on id and time. Duplicates are some for reason allowed, although there are no duplicates and there shouldn't be. Node size is 1900, page size 4 blocks and fill is set to 85%. The index is in one storage area and the data in another. Both with mixed page format. I thought the obvious thing was to use the "PLACEMENT VIA index" feature, so I defined an index on the id only, and used this index as the placement index. (Didn't seem very good idea to me to use the time part here, or would it be that?) I set node size to 900, fill to 70% page size to 2. I placed customers with low id in a seprate area, since the most commonly looked at customers seemed to be those below 300. I used uniform page format for all areas. The gain was unnoticeable. Or even worse, since I couldn't use PCA any longer to measure performance. (I got strange errors from SORT/MERGE, probably I ran out of some resource.) Still, "normal" customer gave an instant answer, whereas customer 26 caused a half-minute delay. One thing I noticed was that with this latter configuration the unique index wasn't used for the forward query, but the placement index was used here as well, which would mean that ascending or descending doesn't have any importance anymore in terms of performance. It should be added that the sizes of the areas were very narrow, which could be one reason for the bad result. I'm using a test copy of the full-scale database here at our development site, and I wanted to be nice to those on the same disk as the database. What to do? I have some more or less desperate idea, like cutting the relation in two, one small with the records of the last weeks and one with the rest, but I stay away from that if I can. Any ideas, hints, advice etc to improve the situation is appreciated. -- Erland Sommarskog - ENEA Data, Stockholm - sommar@enea.se Bowlers on strike!