bagron@cs.vu.nl (Rene Baart) (02/24/90)
I am faced with an intruiging yet difficult problem. I'm working on a user-friendly front-end to a relational database that's supposed to isolate the user from the underlying data model of the database. The user should be able to just list the attributes he or she wants in a report (such as EMPLOYEE_NAME and SALARY) and have the system generate the appropriate SQL query. Since the listed attributes may be in different relations, the system has to figure out how to join the involved tables. This is where things get difficult. The information available to the program is the set of tables in the database (T1..Tn), the set of tables from which attributes are being queried (Q1..Qm, with m<n) and a list of "arrows" which indicate that a certain table has a foreign key in another table (Tx-->Ty). Graphically, the problem can be represented like this: +------- SALARY | v CHILD <--- EMPLOYEE ---> POSITION <-- TITLE ^ | +------- DEPARTMENT I am in need of a general algorithm for finding the set of tables that is needed to perform the join. What's more, if there is more than one possible join, the system should provide a list of alternatives for the user to choose from. So far, I haven't been able to come up with a solution that handles every possible situation correctly, and I'm starting to wonder if there exist one at all. So, if there's anyone out there in NetLand who has thought about this problem before and who knows of a solution, I'd love to hear about it. Or if anyone could point me towards some books or articles dealing with this kind of problem, I'd be grateful too. (I suppose every existing natural-language front-end faces a similar task?) Thanks in advance! +--------------+ Domain : bagron@cs.vu.nl, RBAART@neabbs.UUCP | Rene | Snail : Wibautlaan 32, 1181 XW Amstelveen, The Netherlands | Baart | UUCP : ..!mcvax!neabbs!rbaart, ..!mcvax!botter!bagron +--------------+ Fido : RENE BAART at 2:280/2 Voice : (020)-454191
jkrueger@dgis.dtic.dla.mil (Jon) (02/27/90)
bagron@cs.vu.nl (Rene Baart) writes: >The information available to the program is the set of tables in the >database (T1..Tn), the set of tables from which attributes are being >queried (Q1..Qm, with m<n) and a list of "arrows" which indicate >that a certain table has a foreign key in another table (Tx-->Ty). >I am in need of a general algorithm for finding the set of tables that >is needed to perform the join. What's more, if there is more than one >possible join, the system should provide a list of alternatives for the >user to choose from. Table definitions are generated from database design, you're trying to "decompile" the other way. You lack master-detail information, referential integrities, domains, etc. You can make some good guesses but in general you can't be sure. This is as much a measure of the power of the relational model to ask interesting questions as it is of the poverty of commercial products to express them. Consider how aggregates aren't handled in typical forms-based canned query tools, and when they are, how much the user must know about the underlying database design. Or consider questions answerable only by queries involving self-joins: how will this possibility be spotted by your tool? But don't feel bad, most third-party front ends don't try anything of the sort; in general, they're tabular, not relational. -- Jon -- Jonathan Krueger jkrueger@dtic.dla.mil uunet!dgis!jkrueger The Philip Morris Companies, Inc: without question the strongest and best argument for an anti-flag-waving amendment.
davidm@uunet.UU.NET (David S. Masterson) (02/28/90)
In article <774@dgis.dtic.dla.mil> jkrueger@dgis.dtic.dla.mil (Jon) writes: bagron@cs.vu.nl (Rene Baart) writes: >The information available to the program is the set of tables in the >database (T1..Tn), the set of tables from which attributes are being >queried (Q1..Qm, with m<n) and a list of "arrows" which indicate >that a certain table has a foreign key in another table (Tx-->Ty). > >I am in need of a general algorithm for finding the set of tables that >is needed to perform the join. What's more, if there is more than one >possible join, the system should provide a list of alternatives for the >user to choose from. > Table definitions are generated from database design, you're trying to "decompile" the other way. You lack master-detail information, referential integrities, domains, etc. You can make some good guesses but in general you can't be sure. Wait a second, now. With a knowledge of the foreign key relationships between tables (his "arrows"), he has all the information he needs to get an answer. Its basically a variation on the traveling salesman problem: given a couple of points on a graph find all the routes between those points. I was never any good at solving that problem, so I can't provide the algorithm for the original question, but all the information is there. After all, he isn't asking for a single query to do this, but the general algorithm! This is as much a measure of the power of the relational model to ask interesting questions as it is of the poverty of commercial products to express them. Consider how aggregates aren't handled in typical forms-based canned query tools, and when they are, how much the user must know about the underlying database design. Or consider questions answerable only by queries involving self-joins: how will this possibility be spotted by your tool? But don't feel bad, most third-party front ends don't try anything of the sort; in general, they're tabular, not relational. Why the diatribe?!? The key to answering this query is not the front-end, but the back-end. Many newer relational systems are keeping information in the data dictionary that help the system understand the data model (things like referential integrity constraints, domains, etc.). Since the data dictionary of a relational database is queriable like any other part of the relational database (at least with any relational system worth its salt), answering questions about the data model of the database, though not exactly trivial, should be doable. -- =================================================================== David Masterson Consilium, Inc. uunet!cimshop!davidm Mt. View, CA 94043 =================================================================== "If someone thinks they know what I said, then I didn't say it!"
bagron@cs.vu.nl (Rene Baart) (03/01/90)
In article <CIMSHOP!DAVIDM.90Feb27104331@uunet.UU.NET> cimshop!davidm@uunet.UU.NET (David S. Masterson) writes: >In article <774@dgis.dtic.dla.mil> jkrueger@dgis.dtic.dla.mil (Jon) writes: [ some stuff from my original posting deleted..] > >Wait a second, now. With a knowledge of the foreign key relationships between >tables (his "arrows"), he has all the information he needs to get an answer. >Its basically a variation on the traveling salesman problem: given a couple of >points on a graph find all the routes between those points. I was never any >good at solving that problem, so I can't provide the algorithm for the >original question, but all the information is there. After all, he isn't >asking for a single query to do this, but the general algorithm! > The join problem indeed seems similar to that Traveling Salesman problem, but the big difference is that in solving the salesman problem you're trying to find the cheapest or shortest path. With the joins, a shorter path isn't neccesarily a better one. (Or the one the user intended.) But then again, you couldn't present a user with ALL possible paths for a given query since that set of solutions will definitely explode as the database gets more complex. The "solution" (note the quotes) I came up with after looking at a lot of output from a little program that prints all the possible paths between any two nodes is that paths with the least number of relations are more desireable. I define a relation in this context as a node that has two arrows pointing inwards, like this: EMPL.--> POSITION <-- DEPT. In my current implementation, I let the user choose between all paths that have a minimum number of relations. This SEEMS to work well in practice. (It sure cuts down the number of paths) But I don't have too much confidence in this emperical solution, hence the reason for my posting. But by now I'm pretty convinced that the problem cannot be solved without introducing extra semantics (like a data dictionary). I'll have to look into that. +--------------+ Domain : bagron@cs.vu.nl, RBAART@neabbs.UUCP | Rene | Snail : Wibautlaan 32, 1181 XW Amstelveen, The Netherlands | Baart | UUCP : ..!mcvax!neabbs!rbaart, ..!mcvax!botter!bagron +--------------+ Fido : RENE BAART at 2:280/2 Voice : (020)-454191