zmhasan@watdragon.waterloo.edu (Ziaul Masum Hasan) (10/05/89)
Could any one out there explain me why the use of a method in a query predicate (<attribute-name operator value>), as an attribute-name or as an operator, cause difficulties with query compilation and optimization ? M Hasan.
jack@odi.com (Jack Orenstein) (10/06/89)
In article <16958@watdragon.waterloo.edu> zmhasan@watdragon.waterloo.edu (Ziaul Masum Hasan) writes: >Could any one out there explain me why the use of a method in a query >predicate (<attribute-name operator value>), as an attribute-name or >as an operator, cause difficulties with query compilation and optimization ? Query optimization depends on the use of pre-computed results, usually in the form of an index. For example, suppose you have a set of people, indexed by name. Then the name index is a data structure that provides for rapid search on name, and returns a reference to one person, (or more, if names are not unique). In order to be accurate, the name index must be kept up to date. If a new person is added to the set of people, a new entry is needed in the index. If a person's name changes, this must be reflected in the index. It is the need to keep an index up to date that makes it difficult to index on the output from a method or function. Suppose that the same set of people is indexed by income(), where income() is a function computed from various pieces of information associated with each person. It is difficult to predict when the output from the income() function changes as a result of changes to information about a person. For example, income() is sensitive to changes in interest rates given by banks where the person has an account. In order for the index to be accurate, it would have to be updated in response to each such change. Another difficulty is due to functions whose value changes rapidly. Even if the index could, in principle, be kept up to date, it would not be feasible to do so. E.g., an index on age(), where age is computed from a stored birthdate and a time-of-day clock would have to be updated for each clock tick (unless the index maintainer is very smart). If all dependencies can be located, e.g. by analyzing the source code of the income() function, and of functions it calls, then index updating would be possible, but I don't know of anyone doing this in the general case. (Also, this approach just doesn't work if the function value depends on random numbers, input, or anything not stored in the database.) Indexes on stored values, which may be references to other objects, are much easier to update since the relevant updates are fairly obvious. For further information, see the paper by Maier & Stein (Stein & Maier?) in the book edited by Shriver & Wegner (Wegner & Shriver?) and the paper by Shekita and Carey from the EXODUS group in the 1989 SIGMOD proceedings. These papers talk about index updating more than optimization. As for a method appearing as an operator: this is less of a problem, as long as certain properties are guaranteed. For example, =, <, and > can be defined for rational numbers, and an index keyed by instances of a rational number class can be created. As long as the optimizer knows (because the class definer asserts it) that the functions implementing =, <, and > for rational numbers behave "as expected", optimization can proceed. See some of the early POSTGRES papers for more information along these lines. Finally, I'm not sure what you mean by difficulties in compilation. Optimization is the hard problem, but if the optimizer gets stuck, a "brute-force" strategy can be used. Jack Orenstein Object Design, Inc.
mao@eden (Mike Olson) (10/10/89)
In <1989Oct6.153505.24492@odi.com>, jack@odi.com (Jack Orenstein) writes > As for a method appearing as an operator: this is less of a problem, > as long as certain properties are guaranteed. For example, =, <, and > > can be defined for rational numbers, and an index keyed by instances > of a rational number class can be created. As long as the optimizer > knows (because the class definer asserts it) that the functions > implementing =, <, and > for rational numbers behave "as expected", > optimization can proceed. See some of the early POSTGRES papers for > more information along these lines. mr. orenstein is correct, but he raises an interesting point that some readers may have overlooked. in order to optimize queries using <, the optimizer has to understand what < means. this isn't a problem for the operators most vendors supply -- rationals have several well-defined partial orderings, for example. this isn't always true, though, and therefore it's not always easy to optimize queries using methods (or comparison operators). for example, the initial postgres rtree index implementation provides operators for "box contains", "box is contained by", "box is left {or right or above or below} box," and so on. making the optimizer smart enough to deal with all this is a lot of trouble, and is actually a bad thing. the idea is that users can write their own access methods, and you don't want users hacking your optimizer. in general, the "certain properties" you need to guarantee for optimizablity are: + operator is a partial ordering on the relation; + you're willing to build knowledge of the ordering into your optimizer. also, it helps a lot if your index has some useful adjacency properties, but you can survive without them. since you have to build all these smarts into your access method, it's a lot of trouble to put them into the optimizer, as well. mike olson postgres research group uc berkeley mao@postgres.Berkeley.EDU
scholl@inf.ethz.ch (Marc Scholl) (10/10/89)
In article <16958@watdragon.waterloo.edu> zmhasan@watdragon.waterloo.edu (Ziaul Masum Hasan) writes: >Could any one out there explain me why the use of a method in a query >predicate (<attribute-name operator value>), as an attribute-name or >as an operator, cause difficulties with query compilation and optimization ? > >M Hasan. In addition to Jack Orenstein's response that has been focussing around indexing of "encapsulated attributes" (those that are implemented as methods): A further problem for query optimization stems from the fact that typical (relational) optimizers assume the cost for predicate evaluation to be zero, i.e., once we have loaded the tuple no further costs will be incured by accessing the "<attribute>" named nor by evaluating the predicate "<operator>". However, if either of them is a (user supplied) method, this is no longer true. Hence, a smart OODB query optimizer should estimate the cost of these actions, too. Of course, this means that the implementor of the correspondsing methods has to provide cost estimation functions to assist the optimizer. (This is another example for the need of extensible optimizers.) In the extreme, such situations could make the optimizer postpone a selection past a join (if the join is quite selective and the selection predicate is expensive to evaluate). This is obviously in contrast to the well-known heuristics of pushing selections dwon the expression tree. -Marc