[comp.databases] Optimal SQL query needed

greg@harvey.qualcomm.com (greg Noel) (03/14/91)

Please forgive me if this is the wrong news group; I'm not normally embroiled
in SQL problems.  If this is not the right newsgroup, I'd appreciate being told
what the correct one is.

I have an Oracle data base table, with three columns, which I will call A, B,
and C.  There are multiple thousands of value A.  For each value A, there are
a handful of records, perhaps as many as a dozen.  Within each value A, value
B is almost always unique (it's really a time value when the record became
active).  If value B is not unique within A, the tie is broken by value C,
which is unique within each A/B combination (it's the timestamp of when the
record was modified).

Here's the problem:  I want to define a view on this table that only consists
of the most recent active record for each value of A.  How do I define this
view such that the lookup is optimal?  Intuitivly, what I would like to tell
the data base engine is to look up all records with the same value of A, riffle
through them, and select the one with the largest B.C value.

What I am working with right now is
	select * from TBL T1 where A = 'value' and B =
		(select MAX(B) from TBL T2 where T2.B = T1.B)
which works except for the rare case when there is more than one maximum value
of B, which I handle programmatically.  This query is also not optimal as it
requires two indexed lookups within TBL.

Is there a way to do do the query more optimally or to do the whole query in
one request, so that the execptions don't need to be handled by the program?

Tks,

miket@blia.sharebase.com (Mike Tossy) (03/15/91)

In article <1991Mar13.224746.5293@qualcomm.com>, greg@harvey.qualcomm.com (greg Noel) writes:
> ...
> I have an Oracle data base table, with three columns, which I will call A, B,
> and C.  There are multiple thousands of value A.  For each value A, there are
> a handful of records, perhaps as many as a dozen.  Within each value A, value
> B is almost always unique (it's really a time value when the record became
> active).  If value B is not unique within A, the tie is broken by value C,
> which is unique within each A/B combination (it's the timestamp of when the
> record was modified).
> 
> Here's the problem:  I want to define a view on this table that only consists
> of the most recent active record for each value of A.  How do I define this
> view such that the lookup is optimal?  Intuitivly, what I would like to tell
> the data base engine is to look up all records with the same value of A, riffle
> through them, and select the one with the largest B.C value.
> 
> What I am working with right now is
> 	select * from TBL T1 where A = 'value' and B =
> 		(select MAX(B) from TBL T2 where T2.B = T1.B)
> which works except for the rare case when there is more than one maximum value
> of B, which I handle programmatically.  This query is also not optimal as it
> requires two indexed lookups within TBL.
> 
> Is there a way to do do the query more optimally or to do the whole query in
> one request, so that the execptions don't need to be handled by the program?
> 
> Tks,

This is a fairly common problem.  I believe it is caused by a database design
problem.  The problem is that B (and C) tells when a particular record became
valid, but the time when the record became invalid is stored in another record
(or implied by the absence of any record with a higher time stamp).  Notice
that your SQL demonstrates this: it joins the table to itself to create the
record you need.  

I suggest that the solution is to add another column B'.  Column B is the
time the record became active and B' is the time the record became inactive
(was replaced by a more current record).  The active row should have a null
value (or a VERY LARGE value) for B'. Depending on your system, this may
elimate the need for C or you may have to fix C in an analogous way.  Your SQL
becomes (probably not exact Oracle syntax):

 	select * from TBL T1 where A = 'value' 
	   and B <= now() and B' = null;
or
 	select * from TBL T1 where A = 'value' 
	   and B <= now() and now() <= B';

This should be a very fast query and it is easy to build the view you want.

This solution is not without cost.  Inserting a new record becomes a longer
process - you must update the old current record also.  But, the update
should be very fast.  The database does becomes larger.  But, before you reject
this solution out of hand because of the "extra disk space" required.  Do
yourself a favor and do a back of the envolope calculation of how much extra
space is required.  Disk space is cheap.  If it really is only "multiple
thousands" of values of A, you will problably never even notice the extra disk
space utilized by this solution.

I suggest that my solution is the correct one from a database design point
of view and you should only use your solution for an overwhelming reason.

--
      >>>>>>  The above statements are only my opinions <<<<<<

Mike Tossy					ShareBase Coropration
miket@sharebase.com				14600 Wichester Blvd
(408) 378-7575 ext2200				Los Gatos, CA 95030
	(ShareBase is a subsidiary of Teradata Corportation)

manatt@sirius.llnl.gov (Doug Manatt) (03/16/91)

In article <13659@blia.sharebase.com>, miket@blia.sharebase.com (Mike Tossy) writes:
>In article <1991Mar13.224746.5293@qualcomm.com>, greg@harvey.qualcomm.com (greg Noel) writes:
>> ...
>>Intuitivly, what I would like to tell
>>the data base engine is to look up all records with the same value of A riffle
>> through them, and select the one with the largest B.C value.
>>...
>> Is there a way to do do the query more optimally or to do the whole query in
>> one request, so that the execptions don't need to be handled by the program?
>...
>I suggest that the solution is to add another column B'.  Column B is the
>time the record became active and B' is the time the record became inactive


	The quick and cheap solution is to add a 1 byte boolean column that
	tells only if the record is currently valid or invalid.  In our
	design, many of our timestamped entries have a supercede column
	that takes on the two values ("T" or "F") and when an existing entry
	is updated all "F"s are changed to "T"s before the insert.


		Doug Manatt
		Lawrence Livermore Natl Lab
		Human -- (415) 422-7257  FAX -- (415) 422-3160
		manatt@lll-winken.llnl.gov
--
		Doug Manatt
		Lawrence Livermore Natl Lab
		Human -- (415) 422-7257  FAX -- (415) 422-3160
		manatt@lll-winken.llnl.gov

greg@harvey.qualcomm.com (Greg Noel) (03/21/91)

I wrote:
>I have an Oracle data base table, with three columns, which I will call A, B,
>and C.  ....  I want to define a view on this table that only consists
>of the most recent active record for each value of A.  How do I define this
>view such that the lookup is optimal?  Intuitively, what I would like to tell
>the data base engine is to look up all records with the same value of A,
>riffle through them, and select the one with the largest B.C value.

I received several replies to this; my thanks for the interest.  Here's what
was suggested:

J. Pitt <jmp@stan.xx.swin.OZ.AU> (isn't OZ.AU redundant???) asks:
>Have you tried the obvious ?
>	SELECT A, B, MAX(C)
>	FROM   TBL
>	GROUP BY A,B;
>(include other columns as required of course)

Unfortunately, it's the last line that's the killer -- it doesn't seem to
be possible to include other columns, except as summary functions (such as
the MAX or MIN functions).  What I need is the record corresponding to
MAX(B).MAX(C).

Mike Tossy <miket@blia.sharebase.com> suggests adding another column B' that
contains the end of the range for which the record is valid.  He notes:
>This solution is not without cost.  Inserting a new record becomes a longer
>process - you must update the old current record also.  But, the update
>should be very fast.  ....

In a similar vein, someone at U.Iowa <cmdkrj@pmvax.weeg.uiowa.edu> and Doug
Manatt <manatt@sirius.llnl.gov> both suggest adding a flag to the table that
marks the current record.

These are nice solutions, but ones that won't work in this case.  Originally,
we had a historical table containing date ranges and a current table with the
most-recent value.  It was a maintenance nightmare, and this design is an
attempt to eliminate some of those difficulties.  (Mike worries that we might
be put off by the extra disk space involved; that's the least of our worries!)
There are two reasons why these won't work:
    -	I'm not the data base administrator; I can't add another column just
	to improve the processing of a single program.  And because of the
	maintenance hassle described in the next point, it's probably not
	worth the complexity elsewhere in the system.
    -	There are several programs that update this table, and because of the
	order in which the data may be processed, the update is not always to
	the "current" record -- that is, sometimes it is the history that is
	being changed.  The updating programs must be very careful to notice
	when they are not updating the current record and correctly splice in
	the dates.
I do agree with the point Mike makes that this is the "true" solution to the
data base problem; we may consider adopting it at a later date.  Cmdkrj also
quotes Codd, "The only time not to normalize is when you can make an efficiency
gain."  I agree, and this might be one of the times.

Lastly, Edward Screven <escreven@us.oracle.com> suggests:
>    select A, B, C from T t1
>    where not exists
>    (select 'x' from T t2
>     where t2.A =  t1.A
>       and (t2.B > t1.B or
>            (t2.B >= t1.B and 
>             t2.C >  t1.C)));

I've looked at this until I'm cross-eyed and still have no intuition as to
why this works, but it does.  It correctly selects the most-recent record.
It's not optimal in that it joins the table with itself (requiring two indexed
lookups to select on a value from column A), but it does capture both of the
constraints from columns B and C in a single sub-query.

For the moment, I'm using this last solution, but once things settle down a
bit from this conversion, I plan to examine a solution involving date ranges.

Thanks for all the response!
-- 
Greg Noel, Unix Guru         greg@qualcomm.com  or  greg@noel.cts.com