[net.database] Problem with relational theory

ecec@ur-tut.UUCP (Eric Carleen) (08/14/86)

I've found that apparently I have a severe misunderstanding of
the use of a relational data base, specifically Ingres.  I'm sending
out a summary of my question because I've found that a number of
other people have the same misunderstanding.

Suppose that I have 3 tables of names, call them table1, table2, and 
table3.  The names in tables 2 or 3 may or may not be in table1.  I
would like to delete all those records in table1 that are already listed
in the other two tables.

If I give the command:

	delete table1 where table1.name = table2.name
						or table1.name = table3.name

then I get what I was after: entries that match EITHER table2 OR table3
are deleted.

unless... table3 is empty.  If table3 is empty, then nothing is deleted
from table1, regardless of whether or not there are any matching entries
in table2.  (The work-around is obvious, but I thought unnecessary.)

Technical support at Relational Technologies insists that this is the
correct way for a deletion to occur.

I guess that I need to learn more about relational databases - so would
people please mail me names of references that are readable?  Thanks much.


I want to emphasize that I am generally very happy with Ingres, just
surprised at the behavior of this particular delete command.


                            Eric Carleen
                            University of Rochester Medical Center
                            UUCP:  {seismo|allegra}!rochester!ur-msbvax!edc
                            Bitnet: heartedc@uorhbv
                            Phone: (716)-275-5391

hoffman@hdsvx1.UUCP (Richard Hoffman) (08/26/86)

Eric Carleen at Univ. of Rochester Medical Center writes [edited slightly]:
> Suppose that I have 3 tables of names, call them t1, t2, and t3.
> The names in tables 2 or 3 may or may not be in t1. 
> I would like to delete all those records in t1 that are already listed
> in the other two tables.
> 
> If I give the command:
> 
> 	delete t1 where t1.name = t2.name or t1.name = t3.name
> 
> then I get what I was after: entries that match EITHER t2 OR t3
> are deleted.
> 
> unless... t3 is empty.  If t3 is empty, then nothing is deleted
> from t1, regardless of whether or not there are any matching entries
> in t2.  (The work-around is obvious, but I thought unnecessary.)
> 
> Technical support at Relational Technologies insists that this is the
> correct way for a deletion to occur.

And they are, unfortunately, right.  This is because of the way that the
relational theory (upon which Ingres is based) works.

Relational theory starts with the assumption of sets, called domains,
which contain all the legal values for a given attribute.  For instance,
the domain of sex = {male, female}.  The domain of primary colors =
{red, blue, yellow}.  The domain of the national debt is probably the
set containing all positive real numbers taken to two decimal places.
Any set may be a domain.

A table is defined as the subset of a cartesian product of domain sets.
So consider a family whose members are listed in the domain F = {Sara,
David, Jon}. Each of these members can be associated with a sex chosen
from S = {M, F}.  The Cartesian product of FxS is {(Sara, M), (Sara, F),
(David, M), (David, F), (Jon, M), (Jon, F)}.  However, a table which
gave the sexes of the members of these families would be a subset of
FxS, namely {(Sara, F), (David, M), (Jon, M)}.

This formulation sheds light on a number of practices common in relational
data bases, such as the use of domains for guaranteeing integrity (this
should be required, but is often allowed to default to meaninglessly broad
domains such as "all integers", "all strings", etc.) and the refusal of
most RDBs to safeguard order inside a table (the table is just a set, and
order is unimportant in a set).  It also shows you why the Ingres way of
handling your query is correct.

What you are really doing when you run your query is forming a cartesian
product on the sets represented by t1.name, t2.name and t3.name.  Then
you are removing the subset of this set in which t1.name = t2.name or
t1.name = t3.name.  If t3 is empty, it is an EMPTY SET!  The cartestian
product of any number of sets with an empty set MUST BE EMPTY!  So there
is nothing to remove.  

It's not that Ingres doesn't process your query
correctly, it's just that it abandons processing as soon as it spots the
empty set in the cartesian product.  Sort of like a compiler whose optimizer
was clever enough to spot that hundreds of operations on x followed by
x = 0 meant that the hundreds of operations could be omitted.  But the
compiler would probably flag this, and RTI probably errs slightly by not
pointing out joins involving empty sets.

I'm sure you know the two work-arounds: either make t3 non-empty,
with a dummy value, or make two queries, so that the emptiness of 
either t2 or t3 does not effect the deletions based on
the other table.  This last procedure will probably be more efficient 
anyway (even for non-empty sets) for sufficiently large
tables, because the single query method will require on the order
of S1xS2xS3 operations (where Sn is the size in rows of table n),
while the double query method will only require on the order of
S1x(S2+S3) operations, plus a couple of extraneous file opens and
closes.  

I hear that RTI has really cleaned up their join optimizer 
in their most recent release, so the performance difference
between the two queries may be much smaller than I report.  
Also, primary and secondary indexes may make a big difference. Still,
if the tables are large enough to obviate the extra file opens
and closes, you're always safe using the two-query method.
-- 
 Richard Hoffman                | "Oh life is a wonderful cycle of song,
 Schlumberger Well Services     |     A medley of extemporanea.
 hoffman%hdsvx1@slb-doll.csnet  | And Love is a thing that can never go wrong
 PO Box 2175, Houston, TX 77252 | ... And I am Marie of Roumania." --D. PARKER

avg@navajo.STANFORD.EDU (Allen Van Gelder) (09/06/86)

I disagree with Hoffman's attempted justification of the way
Eric Carleen's query is handled.  Having a calculus-based language
loses its point if the user has to know how the system is going
to translate it to relational algebra, possibly incorrectly.

His query, essentially:
	retrieve t1  where t1.name = t2.name  or t1.name = t3.name
has an obvious meaning in relational calculus, and he has a right
to expect the answer to be based on this meaning.

The correct way to translate an "or" in a selection is as a union.
This is shown in "Principles of DB Systems," by J.D. Ullman in
proving that relational algebra has the same power as relational
calculus.  If some system chooses a different route, it is that
system's responsibility to get it right.  Anything else is like
defining the meaning of a program to be the machine code produced
by the compiler -- easy to implement, but not very useful.
Maybe IBM can run this bluff, but I doubt RTI can.

rts@unc.UUCP (Rick Snodgrass) (09/08/86)

In article <819@navajo.STANFORD.EDU> avg@navajo.UUCP (Allen Van Gelder) writes:
>I disagree with Hoffman's attempted justification of the way
>Eric Carleen's query is handled.  Having a calculus-based language
>loses its point if the user has to know how the system is going
>to translate it to relational algebra, possibly incorrectly.
>
>His query, essentially:
>	retrieve t1  where t1.name = t2.name  or t1.name = t3.name
>has an obvious meaning in relational calculus, and he has a right
>to expect the answer to be based on this meaning.
>
>The correct way to translate an "or" in a selection is as a union.
>This is shown in "Principles of DB Systems," by J.D. Ullman in
>proving that relational algebra has the same power as relational
>calculus.

One must be careful in deciding what the meaning of a statement in
any language is. The best way is to go back to the formal semantics
of the language. In his book, Ullman gives the semantics for Quel
(Section 6.4). For this query, it is
	{u | (Exists t1)(Exists t2)(Exists t3)
		(R(t1) and R(t2) and R(t3) and u = t1
			and (t1[name] = t2[name] or t1[name] = t3[name])
		) }
This semantics, which supports Hoffman's reply, is independent of
the relational algebra.

Rick Snodgrass
Department of Computer Science, University of North Carolina at Chapel Hill

hoffman@hdsvx1.UUCP (Richard Hoffman) (09/09/86)

In article <819@navajo.STANFORD.EDU> avg@navajo.UUCP (Allen Van Gelder) writes:
>I disagree with Hoffman's attempted justification of the way
>Eric Carleen's query is handled.  Having a calculus-based language
>loses its point if the user has to know how the system is going
>to translate it to relational algebra, possibly incorrectly.

I wasn't attempting to justify it, merely to explain it.  I agree with you
that QUEL should be kinder to the non-RC-aware user; nevertheless, the
story I told is the reason that Eric's query was not handled the way he
expected it to be.

>[Eric's] query, essentially:
>	retrieve t1  where t1.name = t2.name  or t1.name = t3.name
>has an obvious meaning in relational calculus, and he has a right
>to expect the answer to be based on this meaning.

>The correct way to translate an "or" in a selection is as a union.

RTI implements "or" across multiple tables as a join (or, as I explained,
cartesian product followed by selection).  They do not implement unions
at all (although I imagine that they will have to do so before they finish
their fully distributed data base system).  There you have part of the
problem with Ingres -- it does not fully support the relational model, in
that important operators such as "union" and "difference" and "divide" are
missing (note that most of these can be obtained through user work-around,
or multiple queries, but that's not the point).  In fact, after 15 years
of work and talk and usage, I believe there is still not a single commercially
available RDB which successfully implements the entire relational model.

After listening to a steady stream of user complaints about and problems
with Ingres for almost three years, I am well aware of its many faults.
However, I have yet to find a better overall alternative in terms of
cost, performance, ease-of-use and maturity of the human interface.
-- 
 Richard Hoffman                | "Oh life is a wonderful cycle of song,
 Schlumberger Well Services     |     A medley of extemporanea.
 hoffman%hdsvx1@slb-doll.csnet  | And Love is a thing that can never go wrong
 PO Box 2175, Houston, TX 77252 | ... And I am Marie of Roumania." --D. PARKER

ark@ut-sally.UUCP (Arthur M. Keller) (09/12/86)

Query:
	retrieve t1  where t1.name = t2.name  or t1.name = t3.name

In article <316@unc.unc.UUCP> rts@unc.UUCP (Rick Snodgrass) writes:
>One must be careful in deciding what the meaning of a statement in
>any language is. The best way is to go back to the formal semantics
>of the language. In his book, Ullman gives the semantics for Quel
>(Section 6.4). For this query, it is
>	{u | (Exists t1)(Exists t2)(Exists t3)
>		(R(t1) and R(t2) and R(t3) and u = t1
>			and (t1[name] = t2[name] or t1[name] = t3[name])
>		) }
>This semantics, which supports Hoffman's reply, is independent of
>the relational algebra.
>
>Rick Snodgrass
>Department of Computer Science, University of North Carolina at Chapel Hill

The problem is where the quantifiers go.  Consider:

{u | (Exists t1)(R(t1) and u = t1 and
	[(Exists t2) (S(t2) and t1[name]=t2[name]) or
	(Exists t3) (T(t3) and t1[name]=t3[name])])}

This query has ordinary union semantics, even if either S or T is
empty.  It is not clear that when given a quantifier-free expression
that quantifiers should always be introduced at the beginning.

Arthur
-- 
------------------------------------------------------------------------------
Arpanet: ARK@SALLY.UTEXAS.EDU
UUCP:    {gatech,harvard,ihnp4,pyramid,seismo}!ut-sally!ark

rts@unc.UUCP (Rick Snodgrass) (09/15/86)

In article <5717@ut-sally.UUCP> ark@sally.UUCP (Arthur M. Keller) writes:
>The problem is where the quantifiers go.  Consider:
>
>{u | (Exists t1)(R(t1) and u = t1 and
>	[(Exists t2) (S(t2) and t1[name]=t2[name]) or
>	(Exists t3) (T(t3) and t1[name]=t3[name])])}
>
>This query has ordinary union semantics, even if either S or T is
>empty.  It is not clear that when given a quantifier-free expression
>that quantifiers should always be introduced at the beginning.

The Quel semantics states very clearly that quantifiers, one per
tuple variable, are to be introduced at the beginning--see (6.5)
on page 190 in Ullman's book. Allowing them to be put in arbitrary
positions may solve this one confusion but introduces others, and
is counter to the design of Quel.

aubery@oracle.UUCP (Eric aubery) (09/15/86)

In article <595@ur-tut.UUCP> ecec@ur-tut.UUCP (Eric Carleen) writes:
>
>
>I've found that apparently I have a severe misunderstanding of
>the use of a relational data base, specifically Ingres.  I'm sending
>out a summary of my question because I've found that a number of
>other people have the same misunderstanding.
>
>Suppose that I have 3 tables of names, call them table1, table2, and 
>table3.  The names in tables 2 or 3 may or may not be in table1.  I
>would like to delete all those records in table1 that are already listed
>in the other two tables.
>
>If I give the command:
>
>	delete table1 where table1.name = table2.name
>						or table1.name = table3.name
>
>then I get what I was after: entries that match EITHER table2 OR table3
>are deleted.
>
>unless... table3 is empty.  If table3 is empty, then nothing is deleted
>from table1, regardless of whether or not there are any matching entries
>in table2.  (The work-around is obvious, but I thought unnecessary.)
>

Here is one way to specify the delete in question in Oracle, and presumably
with any SQL implementation.

delete from table1
where name in (
	select table1.name
	from table1,table2
	where
	table1.name = table2.name
	union
	select table1.name
	from table1,table3
	where
	table1.name = table3.name
)

-- 
Eric N. Aubery
ORACLE Corporation				(415)598-8022
Belmont, California				hplabs!oracle!aubery