[comp.databases] outerjoins

aaron@grad2.cis.upenn.edu (Aaron Watters) (08/14/90)

The various varieties of outerjoins are interesting operations.
In particular, they are interesting because they don't act like
relational operations in the sense that
	(r outerjoin s) minus t
or	(r outerjoin s) join t
don't seem to be well defined.  To be fanciful, I would hypothesize
that the outerjoin lies on the fringes of the standard relational paradigm,
and its usefulness indicates useful directions in which the paradigm
should be broadened.  Comments?

What are interesting (simple) examples of the uses for outerjoins? -aaron

swfc@ulysses.att.com (Shu-Wie F Chen) (08/15/90)

In article <28462@netnews.upenn.edu>, aaron@grad2.cis.upenn.edu (Aaron
Watters) writes:
|>The various varieties of outerjoins are interesting operations.

The only reference that I could find on outer joins was in Elmasri and
Navathe's Fundamentals of Database Systems (page 164-165).  Since I
didn't know what they were, I am providing a definition:

"The OUTER JOIN operation was developed to take the union of tuples from
two relations that are not union compatible.  This operation will take
the UNION of tuples in two relations that are partially compatible,
meaning that only some of their attributes are union compatible. The
attributes that are not union compatible from either relation are kept
in the result, and tuples that have no values for these attributes are
padded with null values..."

|>In particular, they are interesting because they don't act like
|>relational operations in the sense that
|>	(r outerjoin s) minus t
|>or	(r outerjoin s) join t
|>don't seem to be well defined.  To be fanciful, I would hypothesize

I don't see how they are not well-defined given the above definition.

|>that the outerjoin lies on the fringes of the standard relational paradigm,
|>and its usefulness indicates useful directions in which the paradigm
|>should be broadened.  Comments?

I don't quite see why the outerjoin lies on the fringes of the *standard
relational paradigm*.  What differentiates it from other operators with
respect to the paradigm?

It is certainly useful, but adding new operators within the relational
model does not broaden the paradigm.

|>
|>What are interesting (simple) examples of the uses for outerjoins? -aaron

Elmasri and Navathe give several examples:

1. Suppose we want a list of all employee names and also the name of the
departments they manage *if they happen to manage a department*.  The
input relations are EMPLOYEE(ENAME, ESSN,...) and DEPARTMENT(DNAME, MGRSSN...).

2. Suppose we want a list of all the people at a school with their name,
SSN, and department, their advisor if they are students, and their rank
if they are faculty.  The input relations are STUDENT(NAME, SSN,
DEPARTMENT, ADVISOR) and FACULTY(NAME, SSN, DEPARTMENT, RANK).

Cheers,
*swfc

aaron@grad2.cis.upenn.edu (Aaron Watters) (08/15/90)

In article <13592@ulysses.att.com> swfc@ulysses.att.com (Shu-Wie F Chen) writes:
>
>I don't quite see why the outerjoin lies on the fringes of the *standard
>relational paradigm*.  What differentiates it from other operators with
>respect to the paradigm?
>
The outerjoin(s) introduce null values.  If you treat null
values as you would any other value then then the outerjoin doesn't
cause any problems.  More intelligent systems (for example Informix
-- since this is the only manual I have handy) treat null values as
a special `unknown' value and, justifiably, do not define certain
relational operations on relations that include nulls in certain
circumstances.  Consider a single relation
	Parental(parent,child)
where either parent or child is allowed to be null whenever the
parents or children for an individual aren't known.  Using datalog
type notatoin the natural way to find grandparents is with the rule
	Grandparent(Person) if Parental(Person,Parent) and
			       Parental(Parent,Child).
But this will give the wrong answer if we treat nulls as ordinary
values, what you need, apparently, is
	CertainGparent(Person) if Parental(Person,Parent) and
			          Parental(Parent,Child) and
			          Parent<>NULL.
which gives you people who are certainly grandparents, or
alternatively
	MaybeGparent(Person) if Parental(Person,NULL).
	MaybeGparent(Person) if Parental(Person,Parent) and
				Parental(Parent,Child).
which gives you people who might well be grandparents.
Since the first rule is dangerous and misleading in the presence of
NULLs, some relational systems require the user to explicitly handle
NULLs, as in the latter two examples (correctly, I think).

If anyone is interested in extensions to the relational approach
that deal with nulls in a more automatic and uniform manner, let me
know.			-aaron

(PS:  
Apologies to those not used to datalog-type notations.
Also, this problem occurs if we interpret NULLs in other
ways.)

tom@eds.com (Tom H. Meyer) (08/16/90)

In article <28462@netnews.upenn.edu> aaron@grad2.cis.upenn.edu.UUCP (Aaron Watters) writes:
>The various varieties of outerjoins are interesting operations.
>In particular, they are interesting because they don't act like
>relational operations in the sense that
>	(r outerjoin s) minus t
>or	(r outerjoin s) join t
>don't seem to be well defined.  To be fanciful, I would hypothesize
>that the outerjoin lies on the fringes of the standard relational paradigm,
>and its usefulness indicates useful directions in which the paradigm
>should be broadened.  Comments?
>
>What are interesting (simple) examples of the uses for outerjoins? -aaron

I recommend you get a hold of this year's SIGMOD proceedings. Arnon
Rosenthal and Cesar Galindo-Legaria presented a paper,
"Query Graphs, Implementing Trees, and Treely-Reorderable Outerjoins"
which addresses your questions directly.

To summarize, outerjoins are useful when you wish to see all the fields
values of the join keys whether they matched something in the other table
or not. As an example, perhaps a query which joined classes to students
might not display all the classes if they had no students enrolled in
them. If this is not acceptable, an outerjoin will solve the problem.

A good introduction to outerjoin theory can be found in

C.J. Date, "The Outer Join", Proc @nd International Conf on Databases
(ICOD-2), Cambridge, England, Sept. 1983

There are many types of outer joins and Date introduces them all in this
paper. However, the "left" and "right" outerjoins (which some SQL
implementations denote: A.a *= B.a and A.a =* B.a) have some nasty
properties like non-associativity and distributivity. This makes query
optimization of outerjoins extremely hard. Dr. Rosenthal presents a 
complete algebra of these outerjoins including a characterization of
when they are well behaved.

Hope this helps

tom meyer, EDS Research               | If I don't see you in the future
tom@edsr.eds.com or ...uunet!edsr!tom | I'll see you in the pasture

davidm@uunet.UU.NET (David S. Masterson) (08/16/90)

In article <28462@netnews.upenn.edu> aaron@grad2.cis.upenn.edu (Aaron Watters)
writes:

   The various varieties of outerjoins are interesting operations.
   In particular, they are interesting because they don't act like
   relational operations in the sense that
	   (r outerjoin s) minus t
   or	(r outerjoin s) join t
   don't seem to be well defined.  To be fanciful, I would hypothesize
   that the outerjoin lies on the fringes of the standard relational paradigm,
   and its usefulness indicates useful directions in which the paradigm
   should be broadened.  Comments?

Read Codd's new book on RMV2 before saying that outer-joins are on the fringes
of the relational model.  The outer join also goes far beyond just the outer
equijoin to include all operators (<, >, !=, etc).

   What are interesting (simple) examples of the uses for outerjoins?

When dealing with object-orientation in a relational database, normalization
of the database may tend to break up an object into several tables.  Not all
of these tables may be filled in for an object.  Trying to put the object back
together at any point in time will be best done with an outer equijoin (based
on the key of the object).
--
====================================================================
David Masterson					Consilium, Inc.
uunet!cimshop!davidm				Mtn. View, CA  94043
====================================================================
"If someone thinks they know what I said, then I didn't say it!"

garyp@cognos.UUCP (Gary Puckering) (08/21/90)

In article <28462@netnews.upenn.edu> aaron@grad2.cis.upenn.edu.UUCP (Aaron Watters) writes:
>What are interesting (simple) examples of the uses for outerjoins? -aaron

Examples are abundant in real data processing applications.  Almost
any application where there are master files and transaction files
will yield outer join queries, because it is often the case that
some masters will not have any transactions but must still be
included in the processing.

Consider a time reporting system where Employees work on Projects
and Time Transactions are generated for the hours worked.  The
database might look like this:

	EMPLOYEES ( EMP_NO, EMP_NAME, ... )
	PROJECTS ( PROJECT_NO, PROJECT_NAME, ... )
	TIME_TXN ( EMP_NO, PROJECT_NO, HOURS )

One of the application requirements is to produce a report which
shows the hours worked on each project.  A simple join query would
do the trick:

	SELECT PROJECT_NAME, SUM(HOURS)
	FROM PROJECTS P, TIME_TXN T
	GROUP BY PROJECT
	ORDER BY PROJECT
	WHERE P.PROJECT_NO = T.PROJECT_NO

Unfortunately, this query will not report on those projects which
have no time transactions in the current reporting period.  In
applications like this one it is often very important to report, for
example, that Project 123 had no activity.  Users sometimes get
upset when they can't find what they are looking for.

The simplest way to address this requirement is to use an outer join
for this query, keeping the unmatched rows in the PROJECTS table.
Without the outer join operation, it would be necessary to write
this request as a parameterized query on TIME_TXN nested within a
query on PROJECTS.  For such a simple application, this is
considerably more work -- and might not be executed very efficiently
by the database engine.

An advantage to having an outer join operation is that applications
that need them can express them within a single query, thus
providing the optimizer with more semantics about the request.  If
an outer join has to be broken up into two queries, the optimizer
can't know what is really going on and therefore can't do a good job
of optimizing such situations.  This could become even more
important if the database is distributed sometime in the future.

The outer join I've described here, where unmatched rows in the
leftmost table are preserved, seems to be the most useful of the
various forms of outer join.  The need for preserving unmatched rows
in the left-hand table, or in both tables, seems far less
prevalent.  I can't even think of any good examples off the top of
my head.  Can anyone else?
-- 
Gary Puckering                             Cognos Incorporated
  VOICE: (613) 738-1338 x6100              P.O. Box 9707
  UUCP:  uunet!mitel!sce!cognos!garyp      Ottawa, Ontario
  INET:  garyp%cognos.uucp@uunet.uu.net    CANADA  K1G 3Z4

aaron@grad2.cis.upenn.edu (Aaron Watters) (08/24/90)

In article <8720@cognos.UUCP> garyp@cognos.UUCP (Gary Puckering) writes:
>
>The outer join I've described here, where unmatched rows in the
>leftmost table are preserved, seems to be the most useful of the
>various forms of outer join.  The need for preserving unmatched rows
>in the left-hand table, or in both tables, seems far less
>prevalent.  I can't even think of any good examples off the top of
>my head.  Can anyone else?

Well, if you have two tables (say Faculty and Student) sharing
a common key (SS#) and representing the same kind of object (person)
but having different attribute sets ([ss#, department, status] and
[ss#, year, phone] respectively]) you might want to create a
`union' of these tables, listing all faculty and students.
However, if we assume that ss# identifies an individual uniquely
you might want any entries with the same ss# to be combined.

This is a hypothetical example for the `both table' case.	-aaron.