[comp.databases] Automatic Query Generation

bagron@cs.vu.nl (Rene Baart) (02/24/90)

I am faced with an intruiging yet difficult problem. I'm working on
a user-friendly front-end to a relational database that's supposed
to isolate the user from the underlying data model of the database.

The user should be able to just list the attributes he or she wants
in a report (such as EMPLOYEE_NAME and SALARY) and have the system
generate the appropriate SQL query. Since the listed attributes may
be in different relations, the system has to figure out how to join
the involved tables. This is where things get difficult.

The information available to the program is the set of tables in the
database (T1..Tn), the set of tables from which attributes are being
queried (Q1..Qm, with m<n) and a list of "arrows" which indicate
that a certain table has a foreign key in another table (Tx-->Ty).

Graphically, the problem can be represented like this:

                                      +-------  SALARY
                                      |
                                      v
    CHILD  <---  EMPLOYEE  --->  POSITION  <--  TITLE
                                      ^
                                      |
                                      +-------  DEPARTMENT

I am in need of a general algorithm for finding the set of tables that
is needed to perform the join. What's more, if there is more than one
possible join, the system should provide a list of alternatives for the
user to choose from. So far, I haven't been able to come up with a
solution that handles every possible situation correctly, and I'm
starting to wonder if there exist one at all.

So, if there's anyone out there in NetLand who has thought about this
problem before and who knows of a solution, I'd love to hear about it.
Or if anyone could point me towards some books or articles dealing with
this kind of problem, I'd be grateful too. (I suppose every existing
natural-language front-end faces a similar task?) Thanks in advance!

+--------------+  Domain : bagron@cs.vu.nl, RBAART@neabbs.UUCP
|    Rene      |  Snail  : Wibautlaan 32, 1181 XW Amstelveen, The Netherlands
|    Baart     |  UUCP   : ..!mcvax!neabbs!rbaart, ..!mcvax!botter!bagron
+--------------+  Fido   : RENE BAART at 2:280/2    Voice  : (020)-454191

jkrueger@dgis.dtic.dla.mil (Jon) (02/27/90)

bagron@cs.vu.nl (Rene Baart) writes:

>The information available to the program is the set of tables in the
>database (T1..Tn), the set of tables from which attributes are being
>queried (Q1..Qm, with m<n) and a list of "arrows" which indicate
>that a certain table has a foreign key in another table (Tx-->Ty).

>I am in need of a general algorithm for finding the set of tables that
>is needed to perform the join. What's more, if there is more than one
>possible join, the system should provide a list of alternatives for the
>user to choose from. 

Table definitions are generated from database design, you're trying to
"decompile" the other way.  You lack master-detail information,
referential integrities, domains, etc.  You can make some good guesses
but in general you can't be sure.  This is as much a measure of the
power of the relational model to ask interesting questions as it is of
the poverty of commercial products to express them.  Consider how
aggregates aren't handled in typical forms-based canned query tools,
and when they are, how much the user must know about the underlying
database design.  Or consider questions answerable only by queries
involving self-joins: how will this possibility be spotted by your
tool?  But don't feel bad, most third-party front ends don't try
anything of the sort; in general, they're tabular, not relational.

-- Jon
-- 
Jonathan Krueger    jkrueger@dtic.dla.mil   uunet!dgis!jkrueger
The Philip Morris Companies, Inc: without question the strongest
and best argument for an anti-flag-waving amendment.

davidm@uunet.UU.NET (David S. Masterson) (02/28/90)

In article <774@dgis.dtic.dla.mil> jkrueger@dgis.dtic.dla.mil (Jon) writes:
   bagron@cs.vu.nl (Rene Baart) writes:
   >The information available to the program is the set of tables in the
   >database (T1..Tn), the set of tables from which attributes are being
   >queried (Q1..Qm, with m<n) and a list of "arrows" which indicate
   >that a certain table has a foreign key in another table (Tx-->Ty).
   >
   >I am in need of a general algorithm for finding the set of tables that
   >is needed to perform the join. What's more, if there is more than one
   >possible join, the system should provide a list of alternatives for the
   >user to choose from. 
   >
   Table definitions are generated from database design, you're trying to
   "decompile" the other way.  You lack master-detail information,
   referential integrities, domains, etc.  You can make some good guesses
   but in general you can't be sure.

Wait a second, now.  With a knowledge of the foreign key relationships between
tables (his "arrows"), he has all the information he needs to get an answer.
Its basically a variation on the traveling salesman problem: given a couple of
points on a graph find all the routes between those points.  I was never any
good at solving that problem, so I can't provide the algorithm for the
original question, but all the information is there.  After all, he isn't
asking for a single query to do this, but the general algorithm!

					  This is as much a measure of the
   power of the relational model to ask interesting questions as it is of
   the poverty of commercial products to express them.  Consider how
   aggregates aren't handled in typical forms-based canned query tools,
   and when they are, how much the user must know about the underlying
   database design.  Or consider questions answerable only by queries
   involving self-joins: how will this possibility be spotted by your
   tool?  But don't feel bad, most third-party front ends don't try
   anything of the sort; in general, they're tabular, not relational.

Why the diatribe?!?

The key to answering this query is not the front-end, but the back-end.  Many
newer relational systems are keeping information in the data dictionary that
help the system understand the data model (things like referential integrity
constraints, domains, etc.).  Since the data dictionary of a relational
database is queriable like any other part of the relational database (at least
with any relational system worth its salt), answering questions about the data
model of the database, though not exactly trivial, should be doable.
--
===================================================================
David Masterson					Consilium, Inc.
uunet!cimshop!davidm				Mt. View, CA  94043
===================================================================
"If someone thinks they know what I said, then I didn't say it!"

bagron@cs.vu.nl (Rene Baart) (03/01/90)

In article <CIMSHOP!DAVIDM.90Feb27104331@uunet.UU.NET> cimshop!davidm@uunet.UU.NET (David S. Masterson) writes:
>In article <774@dgis.dtic.dla.mil> jkrueger@dgis.dtic.dla.mil (Jon) writes:
[ some stuff from my original posting deleted..]
>
>Wait a second, now.  With a knowledge of the foreign key relationships between
>tables (his "arrows"), he has all the information he needs to get an answer.
>Its basically a variation on the traveling salesman problem: given a couple of
>points on a graph find all the routes between those points.  I was never any
>good at solving that problem, so I can't provide the algorithm for the
>original question, but all the information is there.  After all, he isn't
>asking for a single query to do this, but the general algorithm!
>

The join problem indeed seems similar to that Traveling Salesman problem,
but the big difference is that in solving the salesman problem you're
trying to find the cheapest or shortest path. With the joins, a shorter
path isn't neccesarily a better one. (Or the one the user intended.)

But then again, you couldn't present a user with ALL possible paths 
for a given query since that set of solutions will definitely explode
as the database gets more complex.

The "solution" (note the quotes) I came up with after looking at a lot
of output from a little program that prints all the possible paths
between any two nodes is that paths with the least number of relations
are more desireable. I define a relation in this context as a node that
has two arrows pointing inwards, like this: EMPL.--> POSITION <-- DEPT.

In my current implementation, I let the user choose between all paths
that have a minimum number of relations. This SEEMS to work well in 
practice. (It sure cuts down the number of paths)

But I don't have too much confidence in this emperical solution, hence
the reason for my posting. But by now I'm pretty convinced that the
problem cannot be solved without introducing extra semantics (like
a data dictionary). I'll have to look into that. 

+--------------+  Domain : bagron@cs.vu.nl, RBAART@neabbs.UUCP
|     Rene     |  Snail  : Wibautlaan 32, 1181 XW Amstelveen, The Netherlands
|     Baart    |  UUCP   : ..!mcvax!neabbs!rbaart, ..!mcvax!botter!bagron
+--------------+  Fido   : RENE BAART at 2:280/2    Voice  : (020)-454191