laurent@ecrcvax.UUCP (Laurent Vieille) (06/12/90)
As a follow-up to several recent mails discussing query evaluation in deductive (recursive) databases, this note discusses the two major approaches to address this issue, and shortly presents the EKS-V1 deductive database system that has been running since recently. I) ISSUES AND SOLUTIONS IN DEDUCTIVE QUERY ANSWERING: ===================================================== The issue of answering deductive (recursive) queries in database systems is manyfold (we will illustrate these aspects over the classical academic example): anc(X,Y) <- father(X,Y). anc(X,Y) <- father(X,Z), anc(Z,Y). A) COMPLETENESS and TERMINATION of query evaluation - even in case of cyclic data ! - This is always possible in the Datalog case. B) LOGICAL EFFICIENCY. By this, I mean that the evaluation has to focus on the data relevant to the query. In the case of "anc(X,b)", the evaluation should essentially consider only the descendants of "b". This corresponds to the heuristics "push selections first" in (relational) database systems. C) PHYSICAL EFFICIENCY. The size of the database stored in secondary storage, the peculiarities of IO with secondary storage explain the need for specialized techniques to perform, for instance, joins. Further, type-checking and consultation of the database schema can be done at compile-time in a database system, rather than dynamically at query evaluation time. My understanding is that this is why database implementors 1) prefer to think in a set-oriented manner: this permits the implementation of, eg, sort-merge joins; 2) are critical about using, as the low-level engine for a database system, a simple connection of a file system to the WAM - at least in its current form. FORWARD-CHAINING vs BACKWARD-CHAINING approaches: ================================================= During the last couple of years, two different approaches to these issues have been pursued: 1) improving forward-chaining methods 2) improving backward-chaining methods. Curiously, it seems that the first approach is better-known in the logic programming community than the second one. Let us comment somewhat on both on them. FORWARD-CHAINING methods a priori fail on point (B) (in general, they will compute everything ...). Hence, the general framework advocated by this approach: 1) rely on a bottom-up evaluator (eg semi-naive evaluation), 2) rewrite queries and rules in a set of rules which, when evaluated by this bottom-up evaluator, correctly address the point (B) above. Magic Sets is the best-known of these rewriting methods. (See in particular the NAIL! project - Stanford -, the LDL project - MCC - and the Aditi project - Melbourne -). I will not discuss them any further here as they have been recently discussed on the net several times (see also Ullman's paper quoted below). The BACKWARD-CHAINING approach has had to address the two other issues (Completeness and Termination/Physical efficiency), along the two following lines: 1. Algorithmic aspects: the issue is here to design formal algorithms based on backward chaining that guarantee termination and completeness of the evaluation - at least for Datalog. This issue has received many similar solutions, all based on a quasi-identical dynamic programming scheme: answer a given subgoal only once and re-use its answers for all the occurrences of this subgoal. These solutions include: OLDT and SLD-AL resolution - Extension tables - QSQ's methods - Generalization of parsing methods (Earley) - memoization, etc ... Other issues such as coordinating the resolution of answers and subqueries, or various optimizations (eg tail-recursion optimization) have been less discussed in papers but are being worked at in the corresponding prototypes (see below). It should be stressed that these algorithms guarantee termination and completeness whatever (local) selection function is applied to choose subgoals out of the body of a clause. As an example, consider the query 'anc(X,b)'. Using these methods, the query evaluator can first choose the literal 'anc(Z,b)' (out of the body of the recursive rule) without entering into an infinite loop. This property is crucial to solve issues (A) and (B) together. (indeed, if 'father(X,Z)' is selected first out of the recursive rule, as when using Prolog left-to-right order, then the whole database is actually consulted - and the evaluation scheme badly fails on point (B) ). 2. Engineering aspects: implementation techniques for this dynamic programming scheme first include WAM-based work (eg the XWAM of DSD Warren - NACLP'89) - which enhance the WAM engine with a tabulation technique implementing the dynamic programming scheme mentioned above. Another approach, more database-oriented, consists in re-using compilation and data manipulation techniques coming from database systems. These techniques are now used for backward-chaining evaluation of deductive queries over large amounts of data. An example is our work on the DedGin* system (Lefebvre/Vieille - DOOD'89 - Kyoto, December 1989). In this system, all the manipulation of data is made by the database subsystem using join/difference algorithms especially implemented on the file system (see below). Formally, the rationale behind this second kind of engineering solutions is that backward chaining is not bound to the Prolog/WAM depth-first search of the SLD tree (i.e. corresponding to nested-loop joins), but that other search strategies can be used (corresponding, for instance, to sort-merge joins). Engineering-wise, the long-term goal of this research is to develop efficient compilation/optimization techniques of logic queries over large databases. There is still quite a lot to do ! WORKS COMPARING THE TWO APPROACHES: =================================== Rather than to give an exhaustive list of references to these works (I am bound to forget some ...), I will just point to a couple of recent papers discussing the relationship between (and/or the equivalence of) forward-chaining methods with rewriting and backward-chaining methods (further references can be found there - or ask us). Indeed, the field is now mature enough for these comparisons to be fruitful: Bry F. : "Query Evaluation in Recursive Databases: Bottom-up and Top-down Reconciled" - DOOD'89 Miyazaki N. : "Distribution of Selections: The Missing Link between Strategies for Relational and Deductive Databases" - DOOD'89 Seki H. : "On the Power of Alexander Templates" - PODS'89 Ullman J.D. : "Bottom-up beats Top-down" - PODS'89 DOOD = Conference on Deductive and Object-Oriented Databases. PODS = Conference on the Principles Of Database Systems (ACM). To clarify my position with regard to the above discussion, let me add that I have investigated (and implemented) backward-chaining methods. See for instance: Vieille L. : "Recursive Query Processing: the Power of Logic" Theoretical Computer Science, Vol 69 (1), 1989, 1-53 II) EKS-V1: A DEDUCTIVE DATABASE SYSTEM: ======================================== I also would like take advantage of this note to publicize the EKS-V1 system (EKS = ECRC Knowledge base management System), which was recently demonstrated at international Database Conferences (EDBT -Extending Database Technology - Venice, Italy, March 1990; ACM-SIGMOD, Atlantic City, USA, May 1990). EKS-V1 is essentially a deductive database system and offers extensive declarative deductive capabilities over a (secondary-storage based) data base: 1) Its declarative (without cut ..) rule language is essentially an extension of Datalog and includes: full recursion, negation, universal/existential quantifiers, aggregates, evaluable predicates. Recursion and aggregates can be mixed, allowing for instance the 'bill-of-materials' query to be expressed in a compact form. 2) It handles and enforces ANY INTEGRITY CONSTRAINT over the explicit and/or implicit database (i.e. a constraint is any closed formula referring to base and/or virtual - recursive/aggregate - predicates). The system will then reject any update or transaction violating these constraints. To help the user writing high-level transactions, EKS-V1 provides sophisticated primitives for PRE/POST-CONDITIONAL UPDATES and for HYPOTHETICAL REASONING ('assume' primitive). The core inference engine of EKS-V1 is a rule evaluator deriving from the DedGin* prototype mentioned above: rules given to the database are NOT compiled into WAM-like code, but onto set-oriented operations to be evaluated by the database subsystem at query evaluation-time. The user interacts with EKS-V1 from within a Prolog-based environment (MEGALOG, developed at ECRC by Jorge Bocca et al) whose memory-management design - full garbage collection - allows a safe interaction with the fact base. The file system used is the BANG multi-dimensional file system (developed at ECRC by Mike Freeston et al). EKS-V1 should be seen as one of the results of a more global project, called EKS, which aims at designing and developing a fully-fledged Knowledge-Base Management System (including in particular semantic modelling aspects, ..). More information about these systems can be obtained from us. Laurent Vieille ACKNOWLEDGMENTS: Alexandre Lefebvre made comments on early versions of this note. The works and arguments presented here benefited from many discussions with our fellow researchers of the KB group at ECRC. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ECRC, Arabellastr. 17 8000 Munich 81, West-Germany Tel: (+49) 89 92 699 117 FAX: (+49) 89 92 699 170 E-Mail: USA laurent%ecrc.de@pyramid.com EUROPE laurent@ecrc.de