[comp.object] Query optimization in OODB

zmhasan@watdragon.waterloo.edu (Ziaul Masum Hasan) (10/05/89)

Could any one out there explain me why the use of a method in a query         
predicate (<attribute-name operator value>), as an attribute-name or
as an operator, cause difficulties with query compilation and optimization ?

M Hasan.

jack@odi.com (Jack Orenstein) (10/06/89)

In article <16958@watdragon.waterloo.edu> zmhasan@watdragon.waterloo.edu (Ziaul Masum Hasan) writes:
>Could any one out there explain me why the use of a method in a query         
>predicate (<attribute-name operator value>), as an attribute-name or
>as an operator, cause difficulties with query compilation and optimization ?

Query optimization depends on the use of pre-computed results, usually
in the form of an index. For example, suppose you have a set of
people, indexed by name. Then the name index is a data structure that
provides for rapid search on name, and returns a reference to one
person, (or more, if names are not unique). In order to be accurate,
the name index must be kept up to date. If a new person is added to
the set of people, a new entry is needed in the index. If a person's
name changes, this must be reflected in the index.

It is the need to keep an index up to date that makes it difficult to
index on the output from a method or function. Suppose that the same
set of people is indexed by income(), where income() is a function
computed from various pieces of information associated with each
person. It is difficult to predict when the output from the income()
function changes as a result of changes to information about a person.
For example, income() is sensitive to changes in interest rates given
by banks where the person has an account. In order for the index to be
accurate, it would have to be updated in response to each such change.

Another difficulty is due to functions whose value changes rapidly.
Even if the index could, in principle, be kept up to date, it would
not be feasible to do so. E.g., an index on age(), where age is
computed from a stored birthdate and a time-of-day clock would have to
be updated for each clock tick (unless the index maintainer is very
smart). 

If all dependencies can be located, e.g. by analyzing the source code
of the income() function, and of functions it calls, then index
updating would be possible, but I don't know of anyone doing this in
the general case. (Also, this approach just doesn't work if the
function value depends on random numbers, input, or anything not
stored in the database.) Indexes on stored values, which may be
references to other objects, are much easier to update since the
relevant updates are fairly obvious. For further information, see the
paper by Maier & Stein (Stein & Maier?) in the book edited by Shriver
& Wegner (Wegner & Shriver?) and the paper by Shekita and Carey from
the EXODUS group in the 1989 SIGMOD proceedings. These papers talk
about index updating more than optimization.

As for a method appearing as an operator: this is less of a problem,
as long as certain properties are guaranteed. For example, =, <, and >
can be defined for rational numbers, and an index keyed by instances
of a rational number class can be created. As long as the optimizer
knows (because the class definer asserts it) that the functions
implementing =, <, and > for rational numbers behave "as expected",
optimization can proceed. See some of the early POSTGRES papers for
more information along these lines.

Finally, I'm not sure what you mean by difficulties in compilation.
Optimization is the hard problem, but if the optimizer gets stuck, a
"brute-force" strategy can be used.

Jack Orenstein
Object Design, Inc.

mao@eden (Mike Olson) (10/10/89)

In <1989Oct6.153505.24492@odi.com>, jack@odi.com (Jack Orenstein) writes

> As for a method appearing as an operator: this is less of a problem,
> as long as certain properties are guaranteed. For example, =, <, and >
> can be defined for rational numbers, and an index keyed by instances
> of a rational number class can be created. As long as the optimizer
> knows (because the class definer asserts it) that the functions
> implementing =, <, and > for rational numbers behave "as expected",
> optimization can proceed. See some of the early POSTGRES papers for
> more information along these lines.

mr. orenstein is correct, but he raises an interesting point that some
readers may have overlooked.  in order to optimize queries using <, the
optimizer has to understand what < means.  this isn't a problem for the
operators most vendors supply -- rationals have several well-defined partial
orderings, for example.

this isn't always true, though, and therefore it's not always easy to
optimize queries using methods (or comparison operators).  for example,
the initial postgres rtree index implementation provides operators for
"box contains", "box is contained by", "box is left {or right or above or
below} box," and so on.  making the optimizer smart enough to deal with
all this is a lot of trouble, and is actually a bad thing.  the idea is that
users can write their own access methods, and you don't want users hacking
your optimizer.

in general, the "certain properties" you need to guarantee for optimizablity
are:
	+  operator is a partial ordering on the relation;
	+  you're willing to build knowledge of the ordering into
	   your optimizer.

also, it helps a lot if your index has some useful adjacency properties, but
you can survive without them.

since you have to build all these smarts into your access method, it's a
lot of trouble to put them into the optimizer, as well.

					mike olson
					postgres research group
					uc berkeley
					mao@postgres.Berkeley.EDU

scholl@inf.ethz.ch (Marc Scholl) (10/10/89)

In article <16958@watdragon.waterloo.edu> zmhasan@watdragon.waterloo.edu (Ziaul Masum Hasan) writes:
>Could any one out there explain me why the use of a method in a query         
>predicate (<attribute-name operator value>), as an attribute-name or
>as an operator, cause difficulties with query compilation and optimization ?
>
>M Hasan.

In addition to Jack Orenstein's response that has been focussing around
indexing of "encapsulated attributes" (those that are implemented as
methods):

A further problem for query optimization stems from the fact that
typical (relational) optimizers assume the cost for predicate
evaluation to be zero, i.e., once we have loaded the tuple no further
costs will be incured by accessing the "<attribute>" named nor by evaluating
the predicate "<operator>". However, if either of them is a (user supplied)
method, this is no longer true. Hence, a smart OODB query optimizer should
estimate the cost of these actions, too. Of course, this means that the
implementor of the correspondsing methods has to provide cost estimation
functions to assist the optimizer. (This is another example for the need
of extensible optimizers.)

In the extreme, such situations could make the optimizer postpone
a selection past a join (if the join is quite selective and the selection
predicate is expensive to evaluate). This is obviously in contrast to
the well-known heuristics of pushing selections dwon the expression tree.

-Marc