[comp.databases] Outerjoin implementation?

gus@plains.NoDak.edu (jim gustafson) (09/10/90)

To my knowledge, outjoins (left/right) are typically implemented using
either a simple loop-join or else a sort-merge join.  I think Oracle
uses a variation on sort-merge join.  Can someone provide references to
outerjoin implementations?  

In particular, I'm looking for outerjoin variations on conventional
hash join techniques such as Shapiro's hybrid-hash join [TODS 1986].
Thanks, Jim.
-- 
Jim Gustafson	gus@plains.nodak.edu	
uunet!plains!gus (UUCP)	gus@plains (Bitnet)

pcg@cs.aber.ac.uk (Piercarlo Grandi) (09/19/90)

On 10 Sep 90 05:32:22 GMT, gus@plains.NoDak.edu (jim gustafson) said:

gus> To my knowledge, outjoins (left/right) are typically implemented using
gus> either a simple loop-join or else a sort-merge join.  I think Oracle
gus> uses a variation on sort-merge join.  Can someone provide references to
gus> outerjoin implementations?  

Frankly I believe that these are exactly the same option as for joins.

I will actually represent that outjoins are a bogosity, as are null
values.

It may be preferable to avoid the difficult issues posed by null values,
and the definition of an outer join operation, by attaching instead to
each relation (or to the underlying domains) a default value chosen by
the schema designer, and/or to make an outerjoin just a join where a
suitable matching default value is automatically supplied.

Take the classic outer join example of a parts(partno*,description*)
relation and an orders(date*,partno*,quantity,customerno) relation,
where we want a report showing *all*, not just those with orders
outstanding, parts and the quantity on order for each, with those not on
order marked as such.  Generalize to taste :-).

The implementation would then really look like the same for joins.
--
Piercarlo "Peter" Grandi           | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

davidm@uunet.UU.NET (David S. Masterson) (09/20/90)

In article <PCG.90Sep18192004@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
(Piercarlo Grandi) writes:

   It may be preferable to avoid the difficult issues posed by null values,
   and the definition of an outer join operation, by attaching instead to
   each relation (or to the underlying domains) a default value chosen by
   the schema designer, and/or to make an outerjoin just a join where a
   suitable matching default value is automatically supplied.

This method *may* obviate the need for null records that occur in outer joins,
but it doesn't obviate the need for NULLs as attribute values.  In fact, this
is the method documented by Ingres for doing outer joins in Quel.  Using
default values, though, has always had the problem of finding a default value
that will *never* be used as a live value.
--
====================================================================
David Masterson					Consilium, Inc.
uunet!cimshop!davidm				Mtn. View, CA  94043
====================================================================
"If someone thinks they know what I said, then I didn't say it!"

clh@tfic.bc.ca (Chris Hermansen) (09/20/90)

In article <CIMSHOP!DAVIDM.90Sep19100915@uunet.UU.NET> cimshop!davidm@uunet.UU.NET (David S. Masterson) writes:
>In article <PCG.90Sep18192004@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
>(Piercarlo Grandi) writes:
>
>   It may be preferable to avoid the difficult issues posed by null values,

Yay!  Someone else who hates nulls!!

>   and the definition of an outer join operation, by attaching instead to
>   each relation (or to the underlying domains) a default value chosen by
>   the schema designer, and/or to make an outerjoin just a join where a
>   suitable matching default value is automatically supplied.
>
>This method *may* obviate the need for null records that occur in outer joins,
>but it doesn't obviate the need for NULLs as attribute values.  In fact, this
>is the method documented by Ingres for doing outer joins in Quel.  Using
>default values, though, has always had the problem of finding a default value
>that will *never* be used as a live value.

NULLs almost guarantee strange `gotchas', don't you think?  For instance,
instead of writing a statement like

	update foo set bar = 37 where qqsv != 0

the SQLer has to decide if

	update foo set bar = 37 where qqsv != 0 or qqsv is null

is what s/he REALLY means.  Put another way,

	update foo set bar = 37 where qqsv != 0 and qqsv isnt null

provides no extra degree of refinement over the first update statement, even though
it appears to (for the casual observer, at least).

It's also really handy that 

	select qqsv,count(*) from foo
	group by qqsv

produces a separate line for every row in foo that has a null in qqsv,
don't you think?

Note that I use Informix SQL, so your lack of mileage may vary on the above

Chris Hermansen                         Timberline Forest Inventory Consultants
Voice: 1 604 733 0731                   302 - 958 West 8th Avenue
FAX:   1 604 733 0634                   Vancouver B.C. CANADA
clh@tfic.bc.ca                          V5Z 1E5

C'est ma facon de parler.

gordon@meaddata.com (Gordon Edwards) (09/21/90)

In article <1990Sep20.161204.13014@tfic.bc.ca>, clh@tfic.bc.ca (Chris
Hermansen) writes:
|> In article <CIMSHOP!DAVIDM.90Sep19100915@uunet.UU.NET>
cimshop!davidm@uunet.UU.NET (David S. Masterson) writes:
|> >In article <PCG.90Sep18192004@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
|> >(Piercarlo Grandi) writes:
|> >
|> >   It may be preferable to avoid the difficult issues posed by null values,
|> 
|> Yay!  Someone else who hates nulls!!
|> 
|> >   and the definition of an outer join operation, by attaching instead to
|> >   each relation (or to the underlying domains) a default value chosen by
|> >   the schema designer, and/or to make an outerjoin just a join where a
|> >   suitable matching default value is automatically supplied.


IF default values are appropriate, THEN they should be used.  If no default
exists or is inappropriate, then you are compromising the accuracy of your
model.


|> >
|> >This method *may* obviate the need for null records that occur in
outer joins,
|> >but it doesn't obviate the need for NULLs as attribute values.  In
fact, this
|> >is the method documented by Ingres for doing outer joins in Quel.  Using
|> >default values, though, has always had the problem of finding a
default value
|> >that will *never* be used as a live value.


Absolutely, as a matter of fact, I think the NULL is completely necessary for
the accurate representation of data.  If, for example, you have a integer
attribute called status, and you don't currently know the value, what
do you assign to status?  I guess you could look through the possible domain
of status and then pick a value outside that domain to represent unknown,
but then you have to heavily document this so maintenance people know what
you did, don't you think?


|> 
|> NULLs almost guarantee strange `gotchas', don't you think?  For instance,
|> instead of writing a statement like
|> 
|> 	update foo set bar = 37 where qqsv != 0
|>
|> the SQLer has to decide if
|> 
|> 	update foo set bar = 37 where qqsv != 0 or qqsv is null
|> 
|> is what s/he REALLY means.  Put another way,
|> 
|> 	update foo set bar = 37 where qqsv != 0 and qqsv isnt null
|> 
|> provides no extra degree of refinement over the first update
|> statement, even though
|> it appears to (for the casual observer, at least).


Well the above is really placing two distinct constraints on the update.
Constraint 1:  qqsv cannot have a known value of 0.
Constraint 2:  qqsv cannot have an unknown value.

If qqsv was guaranteed to be known, you can always declare it as *not null*.
Then, Constraint 2 would become unecessary.  If you suggest an arbitrary value
for unknowns, what would the statement be?

	ubdate foo set bar = 37 where qqsv != 0 and qqsv != -99999999

This is easier than using null?  I don't think so.


|> 
|> [text deleted...]
|> 
|> Note that I use Informix SQL, so your lack of mileage may vary on the above


????? The above statement makes no sense. ?????

Maybe you have difficulty undersanding the proper use of null.  It means, "I
don't know what this value is."  I'll use an example from Date, "... a 
shipment record might contain a null quantity value (we know that the 
shipment exists, but we do not know the quantity shipped)..."

If you find that you have lots of nulls, then maybe your schema is not
properly 
normalized.  Try checking for transitive dependencies, I have found that they
can contribute to an excessive amount of nulls.

Null is a very important semantic concept.  There have been many times I would
have loved to have null when writing 3GL applications (I had to rely on -999).
As for additional constraints, anytime you deal with an uknown you have take
extra measures to account for it.

What do you think?

--
Gordon S. Edwards             gordon@meaddata.com             Mead Data Central

lugnut@sequent.UUCP (Don Bolton) (09/21/90)

In article <1990Sep20.161204.13014@tfic.bc.ca> clh@tacitus.UUCP (Chris Hermansen) writes:
>In article <CIMSHOP!DAVIDM.90Sep19100915@uunet.UU.NET> cimshop!davidm@uunet.UU.NET (David S. Masterson) writes:
>>In article <PCG.90Sep18192004@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
>>(Piercarlo Grandi) writes:
>>
>>   It may be preferable to avoid the difficult issues posed by null values,
>
>Yay!  Someone else who hates nulls!!
>
>>   and the definition of an outer join operation, by attaching instead to
>>   each relation (or to the underlying domains) a default value chosen by
>>   the schema designer, and/or to make an outerjoin just a join where a
>>   suitable matching default value is automatically supplied.
>>
>>This method *may* obviate the need for null records that occur in outer joins,
>>but it doesn't obviate the need for NULLs as attribute values.  In fact, this
>>is the method documented by Ingres for doing outer joins in Quel.  Using
>>default values, though, has always had the problem of finding a default value
>>that will *never* be used as a live value.
>
>NULLs almost guarantee strange `gotchas', don't you think?  For instance,
>instead of writing a statement like
>
>	update foo set bar = 37 where qqsv != 0
>
>the SQLer has to decide if
>
>	update foo set bar = 37 where qqsv != 0 or qqsv is null
>
>is what s/he REALLY means.  Put another way,
>
>	update foo set bar = 37 where qqsv != 0 and qqsv isnt null
>
>provides no extra degree of refinement over the first update statement, even though
>it appears to (for the casual observer, at least).
>
>It's also really handy that 
>
>	select qqsv,count(*) from foo
>	group by qqsv
>
>produces a separate line for every row in foo that has a null in qqsv,
>don't you think?
>
ARRGGH, I hate it when that happens, have to update the nulls with the
string value "NULL" run the select and then "unupdate" or run several
selects and cut and paste. Either way its messy.

>Note that I use Informix SQL, so your lack of mileage may vary on the above
>
Also with informix, a select count(*) from foo where qqsv is null
will produce a SINGLE row with the count. Can you say inconsistant?
I knew you could :-)

"null is an unknown value, and as such each one is unique"

Always thought I had an odd sense of humor...:-)

The "O" companies database groups the count in either instance.
althought they profess the same quote above, at least consistancy
is present.

davidm@uunet.UU.NET (David S. Masterson) (09/22/90)

In article <1990Sep20.161204.13014@tfic.bc.ca> clh@tfic.bc.ca
(Chris Hermansen) writes:

   NULLs almost guarantee strange `gotchas', don't you think?

Only (IMHO) if you don't understand the concept.

   For instance,
   instead of writing a statement like

	   update foo set bar = 37 where qqsv != 0

   the SQLer has to decide if

	   update foo set bar = 37 where qqsv != 0 or qqsv is null

   is what s/he REALLY means.  Put another way,

	   update foo set bar = 37 where qqsv != 0 and qqsv isnt null

   provides no extra degree of refinement over the first update statement,
   even though it appears to (for the casual observer, at least).

Use of these statements depend on the definition of the data.  I would hope
that the "casual user" would understand the data he is dealing with.  For
instance, in the above statements, it would seem that qqsv might be the
primary key of foo, in which case, it can never be NULL, so all 3 statements
degenerate into the first.  If its not the primary key, then there must be a
reason for qqsv to be NULL and (most likely) a completely different reason why
qqsv might be 0 which would influence the use of "IS NULL". 

   It's also really handy that

	   select qqsv,count(*) from foo group by qqsv

   produces a separate line for every row in foo that has a null in qqsv,
   don't you think?

I don't have a database handy, but I think it also produces a separate line
for every row in foo that has a 1 in qqsv, so what's your point?  There's no
substitute for an intelligent query.
--
====================================================================
David Masterson					Consilium, Inc.
uunet!cimshop!davidm				Mtn. View, CA  94043
====================================================================
"If someone thinks they know what I said, then I didn't say it!"

jeffl@sybase.Sybase.COM (Jeff Lichtman) (09/24/90)

> 	select qqsv,count(*) from foo
> 	group by qqsv
> produces a separate line for every row in foo that has a null in qqsv...

It's not supposed to.  This is one of those cases where nulls are supposed
to act like they're equal to each other.  Does Informix SQL treat each null
as a separate group?

BTW, I agree that nulls in SQL leave a lot to be desired.  I'm not convinced,
though, that default values are adequate in place of nulls.  There's the
problem of having to choose a default that's guaranteed never to be represented
by known data values.  There's the problem of accidentally using a default
value in a calculation as if it were a true value (calculations with null
always yield null).  There's the problem of storage space - many database
systems don't use any space on disk for null values, which can save a lot
in sparsely populated tables.  There's the problem of naive users incorrectly
interpreting default values as true data.  And on and on....
---
Jeff Lichtman at Sybase
{mtxinu,pacbell}!sybase!jeffl  -or- jeffl@sybase.com
"Saints should always be judged guilty until they are proved innocent..."

pcg@cs.aber.ac.uk (Piercarlo Grandi) (09/24/90)

On 19 Sep 90 17:09:15 GMT, cimshop!davidm@uunet.UU.NET (David S.
Masterson) said:

davidm> In article <PCG.90Sep18192004@odin.cs.aber.ac.uk>
davidm> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

pcg> It may be preferable to avoid the difficult issues posed by null
pcg> values, and the definition of an outer join operation, by attaching
pcg> instead to each relation (or to the underlying domains) a default
pcg> value chosen by the schema designer, and/or to make an outerjoin
pcg> just a join where a suitable matching default value is
pcg> automatically supplied.

davidm> This method *may* obviate the need for null records that occur
davidm> in outer joins, but it doesn't obviate the need for NULLs as
davidm> attribute values.

davidm> Using default values, though, has always had the problem of
davidm> finding a default value that will *never* be used as a live
davidm> value.

Only if you use in-band default values (I mean, default values that are
not specifically tagged as being default). Even using in-band default
values is not too bad, because in virtually all the applications where
they would be used, in practice you need to define an in-band value as
default, otherwise you get the equivalent of "dangling pointers", i.e.
referential integrity is violated.

One of the reasons for which NULLs are a bogosity is that they are in
essence unspecified value and type defaults -- this creates a lot of
special casing, because it violates domaining.

They thus introduce a flavour of pointerdom into a flat model, a model
that only manipulates values, not their absence. NULLs introduce all the
problems of references found in some OO languages, while the relational
model has not been designed for references.

davidm> In fact, this is the method documented by Ingres for doing outer
davidm> joins in Quel.

There are those who like Quel, and those who like trouble and use SQL :-).



--
Piercarlo "Peter" Grandi           | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

lugnut@sequent.UUCP (Don Bolton) (09/25/90)

In article <11034@sybase.sybase.com> jeffl@sybase.Sybase.COM (Jeff Lichtman) writes:
>> 	select qqsv,count(*) from foo
>> 	group by qqsv
>> produces a separate line for every row in foo that has a null in qqsv...
>
>It's not supposed to.  This is one of those cases where nulls are supposed
>to act like they're equal to each other.  Does Informix SQL treat each null
>as a separate group?
>
YEP, in the above example they sure do.
then select count(*) from foo where qqsv is null produces one row with the
total of nulls.

>BTW, I agree that nulls in SQL leave a lot to be desired.  I'm not convinced,
>though, that default values are adequate in place of nulls.  There's the
>problem of having to choose a default that's guaranteed never to be represented
>by known data values.  There's the problem of accidentally using a default
>value in a calculation as if it were a true value (calculations with null
>always yield null).  

                    >There's the problem of storage space - many database
>systems don't use any space on disk for null values, which can save a lot
>in sparsely populated tables. 

This is one that gets me about Informix (aka Datachow), I may be mistaken
but I believe it pads nulls with spaces, least thats what comes out on the
selects. Also the columns are all filled to their defined size regardless
of the length of the data string. select max(length column) aint in the
vocabulary. Handy to have and use when you've inherited a used DB that had
all of the design forethought of, well the idea is there :-)

>There's the problem of naive users incorrectly
>interpreting default values as true data.  And on and on....

Like the mail thats returned addressed to Bozo Clown @ Large Feet Inc. :-)
OR TEST RECORD NOT FOR MAILING TEST RECORD DO NOT MAIL :-)

Oh *naive* users.

davidm@uunet.UU.NET (David S. Masterson) (09/26/90)

In article <PCG.90Sep24152920@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
(Piercarlo Grandi) writes:

   davidm> Using default values, though, has always had the problem of
   davidm> finding a default value that will *never* be used as a live
   davidm> value.

   Only if you use in-band default values (I mean, default values that are
   not specifically tagged as being default).

Isn't an "out-band" default (at least part of) the definition of a NULL?

						 Even using in-band default
   values is not too bad, because in virtually all the applications where
   they would be used, in practice you need to define an in-band value as
   default, otherwise you get the equivalent of "dangling pointers", i.e.
   referential integrity is violated.

How would "in-band" defaults be recognized by constraints for the enforcement
of referential integrity?

   One of the reasons for which NULLs are a bogosity is that they are in
   essence unspecified value and type defaults -- this creates a lot of
   special casing, because it violates domaining.

   They thus introduce a flavour of pointerdom into a flat model, a model
   that only manipulates values, not their absence. NULLs introduce all the
   problems of references found in some OO languages, while the relational
   model has not been designed for references.

You'll have to explain this more, I don't catch the point yet.  Three and four
valued logic seem to be a part of the relational model according to Codd (see
"The Relational Model for Database Management Version 2").

   davidm> In fact, this is the method documented by Ingres for doing outer
   davidm> joins in Quel.

   There are those who like Quel, and those who like trouble and use SQL :-).

No doubt about SQL, but there's always been a bit of trouble in the "outerjoin
implementation" in Quel to the point that Ingres had documented the workaround
in a special appendix.  ;-)
--
====================================================================
David Masterson					Consilium, Inc.
uunet!cimshop!davidm				Mtn. View, CA  94043
====================================================================
"If someone thinks they know what I said, then I didn't say it!"

aaron@grad2.cis.upenn.edu (Aaron Watters) (09/27/90)

In article <PCG.90Sep24152920@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>
>One of the reasons for which NULLs are a bogosity is that they are in
>essence unspecified value and type defaults -- this creates a lot of
>special casing, because it violates domaining.
>
what on earth does this mean?

Nulls are certainly NOT a bogosity.  As pointed out by many, they certainly
DO cause havoc in the relational paradigm -- but this indicates that
the relational paradigm is inadequate, NOT that nulls are ``bad.''
[Aside: there is an amazing tendancy in the database community to
disallow anything that doesn't fit in to one's favorite theoretical
structure -- eg, no unions in views, nulls are bad, etcetera.]

Enough abstraction.  Let's consider an example relation

PEOPLE [NAME, AGE]
       [john,  24]
       [jane,  ??]
       [mary,  13]

Then what should we get if we select the people with age 24?
TWO ANSWERS:
the `outer' answer    and	the `inner' answer
       [john,  24]		[john, 24]
       [jane,  24]
Of course, two answers to a single query doesn't fit into the
relational model, so this must be a bad idea, right?

How about if we `group by age'?
AGAIN TWO ANSWERS
       `outer'           `inner'
       [john,  24]       [john,  24]
       [jane,  24]
		         [jane,  ??]
       [mary,  13]
       [jane,  13]       [mary,  13]

       [jane,  ??]
Roughly `outer' answers give `all possibilities' and `inner'
answers give `only that which is certain.'  If you find this
intriguing at all, let me know.		-aaron.

asylvain@felix.UUCP (Alvin E. Sylvain) (09/29/90)

In article <1990Sep20.161204.13014@tfic.bc.ca> clh@tacitus.UUCP (Chris Hermansen) writes:
>In article <CIMSHOP!DAVIDM.90Sep19100915@uunet.UU.NET> cimshop!davidm@uunet.UU.NET (David S. Masterson) writes:
>>In article <PCG.90Sep18192004@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
>>(Piercarlo Grandi) writes:
>>   It may be preferable to avoid the difficult issues posed by null values,
[ ... ]
>Yay!  Someone else who hates nulls!!
[ ... ]
>NULLs almost guarantee strange `gotchas', don't you think?  For instance,
>instead of writing a statement like
>
>	update foo set bar = 37 where qqsv != 0
>
>the SQLer has to decide if
>
>	update foo set bar = 37 where qqsv != 0 or qqsv is null
>
>is what s/he REALLY means.  Put another way,
>
>	update foo set bar = 37 where qqsv != 0 and qqsv isnt null
[ ... ]

In Oracle, you can say:

        update foo set bar = 37 where nvl (qqsv, 0) != 0

Which uses the "nvl" function produce a NULL-value substitution.
It basically says (in this example), if the value is NULL, use 0 instead.
So you want all rows as in your second example.

I'm sure that if you check the Informix manual, you'll find something
similar.  I can't say if this "standard" SQL, or merely Oracle's little
implementaion goodie, but it's probably the easiest, cleanest way of
handling that particular potential snafu.

As to strange 'gotchas', any tool in existance has them, some more
than others.  Take C pointers for example (Yowch!).  It's just another
thing the tool-user must be aware and careful of.   ("Hey, Charlie,
don't leave your hand there!  You'll cut yer silly finger off!" ;-)
------------------------------------------------------------------------
"I got protection for my    |               Alvin "the Chipmunk" Sylvain
affections, so swing your   |   Natch, nobody'd be *fool* enough to have
bootie in my direction!"    |   *my* opinions, 'ceptin' *me*, of course!
-=--=--=--"BANDWIDTH??  WE DON'T NEED NO STINKING BANDWIDTH!!"--=--=--=-
--
------------------------------------------------------------------------
"I got protection for my    |               Alvin "the Chipmunk" Sylvain
affections, so swing your   |   Natch, nobody'd be *fool* enough to have
bootie in my direction!"    |   *my* opinions, 'ceptin' *me*, of course!
-=--=--=--"BANDWIDTH??  WE DON'T NEED NO STINKING BANDWIDTH!!"--=--=--=-

clh@tfic.bc.ca (Chris Hermansen) (09/29/90)

In article <CIMSHOP!DAVIDM.90Sep21104753@uunet.UU.NET> cimshop!davidm@uunet.UU.NET (David S. Masterson) writes:
>In article <1990Sep20.161204.13014@tfic.bc.ca> clh@tfic.bc.ca
>(Chris Hermansen) writes:
>
[my examples deleted...]
>
>Use of these statements depend on the definition of the data.  I would hope
>that the "casual user" would understand the data he is dealing with.  For
>instance, in the above statements, it would seem that qqsv might be the
>primary key of foo, in which case, it can never be NULL, so all 3 statements
>degenerate into the first.  If its not the primary key, then there must be a
>reason for qqsv to be NULL and (most likely) a completely different reason why
>qqsv might be 0 which would influence the use of "IS NULL". 

Sorry, I guess I didn't make myself clear enough.  It's not the fact that
NULL is different than 0 (I think this is neat, as I'm sure most people do)
that bothers me.  It's the (IMHO) weird three-state logical semantic nausea
that hangs around NULLs.  What I mean by that is that NULL values occurring
in arithmetic or logical expressions cause those expressions to behave
differently than `normal' values.  One gets into `extra' testing and such
that doesn't happen if a non-NULL value is used to represent the absence
of a realistic value

>
>   It's also really handy that
>
>	   select qqsv,count(*) from foo group by qqsv
>
>   produces a separate line for every row in foo that has a null in qqsv,
>   don't you think?
>
>I don't have a database handy, but I think it also produces a separate line
>for every row in foo that has a 1 in qqsv, so what's your point?  There's no
>substitute for an intelligent query.


No, no, no!  The whole point of `group by qqsv' implies that you get ONE row
for each value of qqsv; the exception here is that rows containing NULL in
qqsv do not group together because neither NULL = NULL nor NULL <> NULL is
true.  My rather trite example could normally be used to get a count of the
number of rows with different values of qqsv, but as you say, a more
intelligent solution is required when there are NULLs lurking about:

	select qqsv,count(*) from foo where qqsv isnt null group by qqsv;
	select count(*) from foo where qqsv is null

Just to rant and rave for a moment; suppose that a careful programmer sets
up a system of forms and reports, the forms carefully ensuring that the
user cannot enter NULL values into fields.  Suppose that two years later,
a not-so-careful programmer comes along and modifies one of the forms,
inadvertently allowing NULLs to be entered.  Come year end, the report programs,
formerly shielded from lurking NULLs, now produce a monumental pile of
garbage, and the original programmer gets dumped all over because s/he
"didn't allow for the occurrence of NULLs in the database".

You be the judge.

Chris Hermansen                         Timberline Forest Inventory Consultants
Voice: 1 604 733 0731                   302 - 958 West 8th Avenue
FAX:   1 604 733 0634                   Vancouver B.C. CANADA
clh@tfic.bc.ca                          V5Z 1E5

C'est ma facon de parler.

clh@tfic.bc.ca (Chris Hermansen) (09/29/90)

In article <1422@meaddata.meaddata.com> meaddata!gordon@uunet.uu.net writes:
>
>In article <1990Sep20.161204.13014@tfic.bc.ca>, clh@tfic.bc.ca (Chris
>Hermansen) writes:
>|> In article <CIMSHOP!DAVIDM.90Sep19100915@uunet.UU.NET>
>cimshop!davidm@uunet.UU.NET (David S. Masterson) writes:
>|> >In article <PCG.90Sep18192004@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
>|> >(Piercarlo Grandi) writes:

[stuff deleted regarding use of nulls, particularly in outer joins]

>Absolutely, as a matter of fact, I think the NULL is completely necessary for
>the accurate representation of data.  If, for example, you have a integer
>attribute called status, and you don't currently know the value, what
>do you assign to status?  I guess you could look through the possible domain
>of status and then pick a value outside that domain to represent unknown,
>but then you have to heavily document this so maintenance people know what
>you did, don't you think?

OK, I'll bite... suppose I have a variable called status.  Why is NULL better than
a value of "unknown" (assuming it's a character value)?  I would claim that
in fact "unknown" is superior, as it is at least somewhat self documenting (NULL,
by definition, is NOT); furthermore, I can easily spot the "unknown" rows in any
printout or report; finally, I can easily count the rows of particular status
in the table with one select statement (whereas I need two with NULL values allowed,
as rows with NULL values do not group together).

[more stuff deleted]

>Well the above is really placing two distinct constraints on the update.
>Constraint 1:  qqsv cannot have a known value of 0.
>Constraint 2:  qqsv cannot have an unknown value.
>
>If qqsv was guaranteed to be known, you can always declare it as *not null*.

This may relate to the Informix SQL comment below.

>Then, Constraint 2 would become unecessary.  If you suggest an arbitrary value
>for unknowns, what would the statement be?
>
>	ubdate foo set bar = 37 where qqsv != 0 and qqsv != -99999999
>
>This is easier than using null?  I don't think so.

I don't think it's any more difficult!  Maybe this isn't the best example, but
I guess it depends on why the NULLs are in qqsv in the first place.  If they
got there because of a low level of quality control in the data input phase,
I might prefer to see the data input stuff (screen forms or whatever) cleaned
up; if they got there because someone genuinely wanted to show that qqsv didn't
have a value entered into it, I guess it's really a tossup as to whether one
uses NULL or some other value (that could potentially have more `local' meaning
to the application).

As you say above, outer joins make a strong case for the existence of a
"well defined" undefined value.

>
>|> 
>|> [text deleted...]
>|> 
>|> Note that I use Informix SQL, so your lack of mileage may vary on the above
>
>
>????? The above statement makes no sense. ?????

Aww, c'mon!  All I'm saying is that I'm basing my comments on the RDBMS I use,
not on the ANSI standard, nor on Quel/Ingres which was mentioned earlier on.

>Maybe you have difficulty undersanding the proper use of null.  It means, "I
>don't know what this value is."  I'll use an example from Date, "... a 
>shipment record might contain a null quantity value (we know that the 
>shipment exists, but we do not know the quantity shipped)..."

I'd be the first to admit that I may have difficulty understanding the use of
NULL.  But think about your example; why should the input screen have let
someone enter the fact that a shipment was made without knowing the number
of items shipped?  How would you propose to send the customer a bill, if
you don't know whether you sent her/him 3 SPARCStations or 3000? Now suppose
that you're the boss of the shipping department, and you find out that your
programmer decided to allow NULLs in the amount_shipped field by casually
noting that your warehouse contained no more SPARCStations because someone
on the loading dock didn't bother to fill in the waybill properly and the
fast-knuckled keypuncher blithely skipped over the amount_shipped field,
so not only are you going to have to call 200 customers to find out how
many you sent them, but you're going to have to call head office and
tell them that your order for more stock should have had more than zero
items on it.

And in any case, I would claim that a zero would do as well as a NULL here, since
most shipping departments would hardly bother to ship zero units (let alone
generate a shipping report with zero units on it).


If you're doing an outer join, a better(?) example of NULLs cropping up might be
orders entered but not yet shipped...

>
>If you find that you have lots of nulls, then maybe your schema is not
>properly 
>normalized.  Try checking for transitive dependencies, I have found that they
>can contribute to an excessive amount of nulls.

Naw, all my schemae are perfectly normalized :-) :-) :-)

>Null is a very important semantic concept.  There have been many times I would
>have loved to have null when writing 3GL applications (I had to rely on -999).
>As for additional constraints, anytime you deal with an uknown you have take
>extra measures to account for it.
>
>What do you think?

Well, back to the status = "unknown" vs status is null; I don't feel that there
is any innate superiority of the null value here, as long as one has control over
the value of the field (ie, it's not the result of an outer join).  Seems to me
that one is semantically using two fields in a lot of cases: qqsv and qqsv_status.

Finally, if I have a table with 6,000,000 records in it, with status NULL in 5,999,999
and status = "complete" in the other, and I use ISQL to execute the following:

	output to printer
	select status,count(*) from foo
	group by status
	order by status;

I will end up with 100,000 PAGES of output on my laser printer.  On the other hand,
if those 5,999,999 records have status = "unknown", I'll get one page.

Again, I (feel I) should emphasize that this particular behaviour may not occur in
ALL SQLs, or even be predicted by the ANSI standard.

Chris Hermansen                         Timberline Forest Inventory Consultants
Voice: 1 604 733 0731                   302 - 958 West 8th Avenue
FAX:   1 604 733 0634                   Vancouver B.C. CANADA
clh@tfic.bc.ca                          V5Z 1E5

C'est ma facon de parler.

clh@tfic.bc.ca (Chris Hermansen) (09/29/90)

In article <11034@sybase.sybase.com> jeffl@sybase.Sybase.COM (Jeff Lichtman) writes:
>> 	select qqsv,count(*) from foo
>> 	group by qqsv
>> produces a separate line for every row in foo that has a null in qqsv...
>
>It's not supposed to.  This is one of those cases where nulls are supposed
>to act like they're equal to each other.  Does Informix SQL treat each null
>as a separate group?

Yeah, it does.  That's why I attached the disclaimer - I had no idea what the
standard said, let alone what Oracle, Ingres, you guys, etc etc did.

>
>BTW, I agree that nulls in SQL leave a lot to be desired.  I'm not convinced,
>though, that default values are adequate in place of nulls.  There's the
>problem of having to choose a default that's guaranteed never to be represented
>by known data values.  There's the problem of accidentally using a default
>value in a calculation as if it were a true value (calculations with null
>always yield null).  There's the problem of storage space - many database
>systems don't use any space on disk for null values, which can save a lot
>in sparsely populated tables.  There's the problem of naive users incorrectly
>interpreting default values as true data.  And on and on....

Well, at the risk of belaboring the point, I think that whenever the implied
semantics of a field suggest a default value, then it's a GOOD THING to use one.
For example, a field named widget_count is never (realistically) going to
have a value less than zero, so if you need an "undefined" value for it, use
-1, or -99999999, or some other nice number.  Similarly, a widget_status
value of "unknown" tells me more than NULL - with NULL, I can't decide whether
the status is genuinely unknown, or whether the system just doesn't know it.

So I don't think that defaults are necessarily adequate (in place of NULLs),
but I don't necessarily agree that NULLs tell you any more than default values
either (except in some cases like outer join).

Chris Hermansen                         Timberline Forest Inventory Consultants
Voice: 1 604 733 0731                   302 - 958 West 8th Avenue
FAX:   1 604 733 0634                   Vancouver B.C. CANADA
clh@tfic.bc.ca                          V5Z 1E5

C'est ma facon de parler.

davidm@uunet.UU.NET (David S. Masterson) (09/30/90)

In article <30203@netnews.upenn.edu> aaron@grad2.cis.upenn.edu (Aaron Watters)
writes:
>Nulls are certainly NOT a bogosity.  As pointed out by many, they certainly
>DO cause havoc in the relational paradigm -- but this indicates that
>the relational paradigm is inadequate, NOT that nulls are ``bad.''

No, it indicates that current relational implementations are not complete, not
that NULLs don't fit the paradigm.

>PEOPLE [NAME, AGE]
>       [john,  24]
>       [jane,  ??]
>       [mary,  13]
>
>Then what should we get if we select the people with age 24?
>
>TWO ANSWERS:
>the `outer' answer    and	the `inner' answer
>	[john,  24]		[john, 24]
>	[jane,  24]
>
>[similar grouping problem deleted]
>
>Roughly `outer' answers give `all possibilities' and `inner'
>answers give `only that which is certain.'  If you find this
>intriguing at all, let me know.		-aaron.

This is better known as 3-valued logic which is a part of the relational model
(as is 4-valued logic).  Both answers (to both problems) are syntactically
correct, the true answer is determined by the semantics of the problem (which
you touched on at the end).  To paraphrase Codd's terminology, is the first
question:

	all people who definitely have an age = 24
			or
	all people who MAYBE have an age = 24		?


BTW, do current OODBs treat 3 or 4 valued logic?  I imagine they could, but do
current implementations even address this?
--
====================================================================
David Masterson					Consilium, Inc.
uunet!cimshop!davidm				Mtn. View, CA  94043
====================================================================
"If someone thinks they know what I said, then I didn't say it!"

davidm@uunet.UU.NET (David S. Masterson) (09/30/90)

In article <1990Sep28.225442.4995@tfic.bc.ca> clh@tfic.bc.ca (Chris Hermansen)
writes:

   In article <1422@meaddata.meaddata.com> meaddata!gordon@uunet.uu.net writes:
   >Absolutely, as a matter of fact, I think the NULL is completely necessary
   >for the accurate representation of data.  If, for example, you have a
   >integer attribute called status, and you don't currently know the value,
   >what do you assign to status?  I guess you could look through the possible
   >domain of status and then pick a value outside that domain to represent
   >unknown, but then you have to heavily document this so maintenance people
   >know what you did, don't you think?

   OK, I'll bite... suppose I have a variable called status.  Why is NULL
   better than a value of "unknown" (assuming it's a character value)?  I
   would claim that in fact "unknown" is superior, as it is at least somewhat
   self documenting (NULL, by definition, is NOT); furthermore, I can easily
   spot the "unknown" rows in any printout or report; finally, I can easily
   count the rows of particular status in the table with one select statement
   (whereas I need two with NULL values allowed, as rows with NULL values do
   not group together).

Wait a sec.  Think about that a little.  NULL, by definition, is an "unknown"
value, so it is self-documenting.  Going back to the example of an unknown
integer (where 3-valued logic plays an even more important role) in the role
of quantity shipped (just hasn't been entered yet), what would you do with a
question like "What is the sum of the known amounts shipped?".  With the
database understanding what a NULL is, the answer can easily be arrived at
with just "SELECT SUM(qty) FROM table" with no special checking for -999.

   I'd be the first to admit that I may have difficulty understanding the use
   of NULL.  But think about your example; why should the input screen have
   let someone enter the fact that a shipment was made without knowing the
   number of items shipped?  How would you propose to send the customer a
   bill, if you don't know whether you sent her/him 3 SPARCStations or 3000?
   Now suppose that you're the boss of the shipping department, and you find
   out that your programmer decided to allow NULLs in the amount_shipped field
   by casually noting that your warehouse contained no more SPARCStations
   because someone on the loading dock didn't bother to fill in the waybill
   properly and the fast-knuckled keypuncher blithely skipped over the
   amount_shipped field, so not only are you going to have to call 200
   customers to find out how many you sent them, but you're going to have to
   call head office and tell them that your order for more stock should have
   had more than zero items on it.

In some places, this can (and is expected to) happen on a temporary basis.
Usually, though, there are physical (ie. non-computer) checks to make sure
that this is only a temporary phenomena.  In such organizations, its better to
allow them to enter "No Value" than to force them to reinvent the way they do
business.

   And in any case, I would claim that a zero would do as well as a NULL here,
   since most shipping departments would hardly bother to ship zero units (let
   alone generate a shipping report with zero units on it).

Placing a zero here would confuse Accounting, though, as it would look like a
fulfilled shipment.  Point being, zero (or any "in-phase" data) has other
connotations that programs outside the shipping department might treat
differently.  Everyone, though, would view a NULL as a NULL.

Also, don't forget to factor in the idea of constraints into the use of NULLs.
With the single value, all referential constraints remain in balance ("I am
NULL, therefore I don't refer to anything"), so there is no special processing
for semantics ("I'm a status value, therefore if I am 'unknown', I don't refer
to the list of known, possible statuses").  BTW, constraints are the way you
should ensure that the wrong value is never entered (as in "NOT NULL").
--
====================================================================
David Masterson					Consilium, Inc.
uunet!cimshop!davidm				Mtn. View, CA  94043
====================================================================
"If someone thinks they know what I said, then I didn't say it!"

aland@informix.com (alan denney) (10/01/90)

In article <42743@sequent.UUCP> lugnut@sequent.UUCP (Don Bolton) writes:
|In article <1990Sep20.161204.13014@tfic.bc.ca> clh@tacitus.UUCP (Chris Hermansen) writes:
|>In article <CIMSHOP!DAVIDM.90Sep19100915@uunet.UU.NET> cimshop!davidm@uunet.UU.NET (David S. Masterson) writes:
|>>In article <PCG.90Sep18192004@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
|>>(Piercarlo Grandi) writes:
|>
|>NULLs almost guarantee strange `gotchas', don't you think?  For instance,
|>instead of writing a statement like
|>	update foo set bar = 37 where qqsv != 0
|>the SQLer has to decide if
|>	update foo set bar = 37 where qqsv != 0 or qqsv is null
|>is what s/he REALLY means.  Put another way,
|>	update foo set bar = 37 where qqsv != 0 and qqsv isnt null
|>provides no extra degree of refinement over the first update statement,
|> even though it appears to (for the casual observer, at least).

This has to do with the very definition of NULL -- it requires that
NULL values fail any "not equal" filter.  Since a NULL is by definition
an "unknown" value, then you don't know that it's not 0, right?

In any case, you can avoid the whole issue by disallowing NULLs in the
column(s) in question (by creating the column with NOT NULL).

|>It's also really handy that 
|>
|>	select qqsv,count(*) from foo
|>	group by qqsv
|>
|>produces a separate line for every row in foo that has a null in qqsv,
|>don't you think?
|>
|ARRGGH, I hate it when that happens, have to update the nulls with the
|string value "NULL" run the select and then "unupdate" or run several
|selects and cut and paste. Either way its messy.

There was quite a bit of theoretical debate about which method was
"proper".  Now that ANSI has spoken, Informix engines have been changed
to group NULL values together (the way you apparently prefer).  This
change is effective in the current release - 4.0.

|Also with informix, a select count(*) from foo where qqsv is null
|will produce a SINGLE row with the count. Can you say inconsistant?
|I knew you could :-)

Uh, can you say "pilot error"?  [-:  

How can you call that inconsistent?  The response you complain about
is exactly what you asked for: how many rows have NULL for qqsv.
No GROUPing is involved.

--
Alan S. Denney # Informix # aland@informix.com # {pyramid|uunet}!infmx!aland

 "The driver says, 'One more cup of coffee and I'll be all right...'
  'Pop a Bennie, another Bennie'..."                - The Bobs, "Bus Plunge"

aaron@grad2.cis.upenn.edu (Aaron Watters) (10/01/90)

[Chris Hermanson argues that nulls are bad...]

In article <1990Sep28.231042.5117@tfic.bc.ca> clh@tacitus.UUCP (Chris Hermansen) writes:
>For example, a field named widget_count is never (realistically) going to
>have a value less than zero, so if you need an "undefined" value for it, use
>-1, or -99999999, or some other nice number.  Similarly, a widget_status
>value of "unknown" tells me more than NULL - with NULL, I can't decide whether
>the status is genuinely unknown, or whether the system just doesn't know it.

While admitting that this last point may be too deep for me, let's
once again consider an example.  What if we have two stockrooms
each of which contains a number of widgets.  Room A has 10 widgets
and noone knows how many widgets are in room B. The total number of
widgets in both rooms is therefore either 9 or -99999989, of course.

That was a joke, I presume you realize, but how does one get around
this summation anomaly?  It seems you need to have something analogous
to the following:
	If all widgetcounts are positive then X=sum(widgetcount)
	otherwise X=-1.
Furthermore, if you have a large number of conventions like this
more complex queries (especially if you have complex views) can
become quite baroque.  If you use actual null values, however,
the queries are still somewhat baroque, but easier to understand.

A personal interest of mine is the question of whether the
programmer can ignore the possibility of nulls entirely and have
the system handle them in a logically uniform way -- which, I
argue is not currently done automatically.		-aaron.

aaron@grad2.cis.upenn.edu (Aaron Watters) (10/01/90)

In response to my claim that difficulties with nulls show that...
>>the relational paradigm is inadequate, NOT that nulls are ``bad.''

In article <CIMSHOP!DAVIDM.90Sep30013925@uunet.UU.NET> cimshop!davidm@uunet.UU.NET (David S. Masterson) writes:
>
>No, it indicates that current relational implementations are not complete, not
>that NULLs don't fit the paradigm.
...
>This is better known as 3-valued logic which is a part of the relational model
>(as is 4-valued logic).  Both answers (to both problems) are syntactically
>correct, the true answer is determined by the semantics of the problem (which
>you touched on at the end).  To paraphrase Codd's terminology, is the first
>question:
>
>	all people who definitely have an age = 24
>			or
>	all people who MAYBE have an age = 24		?
>
First, why do you say these are not different aspects to the correct
answer to a single query?  Imagine, if you will,
that I pose a very complex query
built up from complex views whose definitions I do not know.
Suppose I want to know those tuples known to satisfy the query
and those not-known to not-satisfy the query.  Do I have to pose
two queries?  Do each of the views I use have to be defined twice,
once for definiteness and once for maybeness?  Why?  Roughly, I
would argue that it's okay if the system allows the user to ask
for maybes and definites, but if s/he doesn't specify s/he should
get both.

Second, note that according to the `semantic completeness' metric
no reasonable database system will fill the bill -- because the
problem is NP-complete for a fixed query and varying data. (Teaser.)

I disagree that the `relational paradigm' includes `three valued
logic' any more sensibly than Pascal `includes Prolog.' (IE, you
can certainly implement one in the other, but that's beside the
point.)

		-aaron.

gordon@meaddata.com (Gordon Edwards) (10/01/90)

In article <1990Sep28.225442.4995@tfic.bc.ca>, clh@tfic.bc.ca (Chris
Hermansen) writes:
|> In article <1422@meaddata.meaddata.com> meaddata!gordon@uunet.uu.net writes:
|> 
[lots of stuff deleted. -gordon]

|> in the table with one select statement (whereas I need two with NULL
values allowed,
|> as rows with NULL values do not group together).

Really?  This must be a quirk of Informix.  I have used Informix a little, but
most of my time has been spent in Sybase and Ingres, both of which group
NULLs.  
|> 
|> [more stuff deleted]
|> 
|> >Well the above is really placing two distinct constraints on the update.
|> >Constraint 1:  qqsv cannot have a known value of 0.
|> >Constraint 2:  qqsv cannot have an unknown value.
|> >
|> >If qqsv was guaranteed to be known, you can always declare it as
*not null*.
|> 
|> This may relate to the Informix SQL comment below.
|> 
|> >Then, Constraint 2 would become unecessary.  If you suggest an
arbitrary value
|> >for unknowns, what would the statement be?
|> >
|> >	ubdate foo set bar = 37 where qqsv != 0 and qqsv != -99999999
|> >
|> >This is easier than using null?  I don't think so.
|> 
|> I don't think it's any more difficult!  Maybe this isn't the best
example, but
|> I guess it depends on why the NULLs are in qqsv in the first place.  If they
|> got there because of a low level of quality control in the data input phase,
|> I might prefer to see the data input stuff (screen forms or whatever)
cleaned
|> up; if they got there because someone genuinely wanted to show that
qqsv didn't
|> have a value entered into it, I guess it's really a tossup as to whether one
|> uses NULL or some other value (that could potentially have more
`local' meaning
|> to the application).

As the DBA, it might not be possible to control applications.  After the system
is running for two months, a manager might want to interface a spreadsheet to 
the DBMS.  At that point, unless you have constraints in your DBMS, you have no
control.

|> 
|> As you say above, outer joins make a strong case for the existence of a
|> "well defined" undefined value.
|> 
|> >
|> >|> 
|> >|> [text deleted...]
|> >|> 
|> >|> Note that I use Informix SQL, so your lack of mileage may vary on
the above
|> >
|> >
|> >????? The above statement makes no sense. ?????
|> 
|> Aww, c'mon!  All I'm saying is that I'm basing my comments on the
RDBMS I use,
|> not on the ANSI standard, nor on Quel/Ingres which was mentioned earlier on.
|> 

OK.  It sort of sounded like, "Informix is great and you other losers..."  I
just wanted to double check.

|> >Maybe you have difficulty undersanding the proper use of null.  It
means, "I
|> >don't know what this value is."  I'll use an example from Date, "... a 
|> >shipment record might contain a null quantity value (we know that the 
|> >shipment exists, but we do not know the quantity shipped)..."

Previous comment by me was a bit unfair, I apologize.

|> 
|> I'd be the first to admit that I may have difficulty understanding
the use of
|> NULL.  But think about your example; why should the input screen have let
|> someone enter the fact that a shipment was made without knowing the number
|> of items shipped?  How would you propose to send the customer a bill, if
|> you don't know whether you sent her/him 3 SPARCStations or 3000? Now suppose
|> that you're the boss of the shipping department, and you find out that your
|> programmer decided to allow NULLs in the amount_shipped field by casually
|> noting that your warehouse contained no more SPARCStations because someone
|> on the loading dock didn't bother to fill in the waybill properly and the
|> fast-knuckled keypuncher blithely skipped over the amount_shipped field,
|> so not only are you going to have to call 200 customers to find out how
|> many you sent them, but you're going to have to call head office and
|> tell them that your order for more stock should have had more than zero
|> items on it.

OK, so Date's example sucks. :-)  Once again, as the DBA, you might not have
control over front-end constraint enforcement.  In your example above, I would
have a number of checks to catch such situations (eg. end of day routines to
catch NULL shipment entries, dock verification of shipment quantities, etc).

|> 
|> And in any case, I would claim that a zero would do as well as a NULL
here, since
|> most shipping departments would hardly bother to ship zero units (let alone
|> generate a shipping report with zero units on it).
|> 
|> 
|> If you're doing an outer join, a better(?) example of NULLs cropping
up might be
|> orders entered but not yet shipped...
|> 
|> >
|> >If you find that you have lots of nulls, then maybe your schema is not
|> >properly 
|> >normalized.  Try checking for transitive dependencies, I have found
that they
|> >can contribute to an excessive amount of nulls.
|> 
|> Naw, all my schemae are perfectly normalized :-) :-) :-)

Isn't everyones? 8-0

|> 
|> >Null is a very important semantic concept.  There have been many
times I would
|> >have loved to have null when writing 3GL applications (I had to rely
on -999).
|> >As for additional constraints, anytime you deal with an uknown you
have take
|> >extra measures to account for it.
|> >
|> >What do you think?
|> 
|> Well, back to the status = "unknown" vs status is null; I don't feel
that there
|> is any innate superiority of the null value here, as long as one has
control over
|> the value of the field (ie, it's not the result of an outer join). 
Seems to me
|> that one is semantically using two fields in a lot of cases: qqsv and
qqsv_status.
|> 
|> Finally, if I have a table with 6,000,000 records in it, with status
NULL in 5,999,999
|> and status = "complete" in the other, and I use ISQL to execute the
following:
|> 
|> 	output to printer
|> 	select status,count(*) from foo
|> 	group by status
|> 	order by status;
|> 
|> I will end up with 100,000 PAGES of output on my laser printer.  On
the other hand,
|> if those 5,999,999 records have status = "unknown", I'll get one page.
|> 
|> Again, I (feel I) should emphasize that this particular behaviour may
not occur in
|> ALL SQLs, or even be predicted by the ANSI standard.
|> 

You previous example may indeed be specific to Informix, and if this were true
globally, I would put unknown in also. :-)

I still maintain NULLs are important, however, in light of past conversations,
I think I should add some remarks.

1.  As I said before, if defaults are appropriate, use them.

2.  I agree with those who say NULLs are not perfect.

3.  If anything, specific UNKNOWN and INAPPROPRIATE values should be added.

I thought your previous post recommended that the concept of NULL was
wrong.  If
you are arguing that NULL needs to be more specific, then I probably agree with
you.

Sorry, for any misunderstandings.

-- Gordon     (gordon@meaddata.com)

clh@tfic.bc.ca (Chris Hermansen) (10/03/90)

In article <30354@netnews.upenn.edu> aaron@grad2.cis.upenn.edu.UUCP (Aaron Watters) writes:
>[Chris Hermanson argues that nulls are bad...]

Geeze,  ^^^^^^^^^ it's HermansEn, ok? :-) :-) :-)

>In article <1990Sep28.231042.5117@tfic.bc.ca> clh@tacitus.UUCP (Chris Hermansen) writes:
>>For example, a field named widget_count is never (realistically) going to
>>have a value less than zero, so if you need an "undefined" value for it, use
>>-1, or -99999999, or some other nice number.  Similarly, a widget_status
>>value of "unknown" tells me more than NULL - with NULL, I can't decide whether
>>the status is genuinely unknown, or whether the system just doesn't know it.
>
>While admitting that this last point may be too deep for me, let's
>once again consider an example.  What if we have two stockrooms
>each of which contains a number of widgets.  Room A has 10 widgets
>and noone knows how many widgets are in room B. The total number of
>widgets in both rooms is therefore either 9 or -99999989, of course.

That's definitely a problem.  One solution is to make the widget count
NULL in room B, but then you have either 10 (I think this is the `standard')
or NULL (ie undefined; I'm pretty sure this is not the `standard') for the sum.

Putting myself in the shoes of the person writing the query, I would say
that none of the above answers is very enlightening, unless all I really
want to know is if we have five or six (but not eleven!) widgets.

My argument was not really that NULLs are pro forma bad, just that I
personally don't like the three-state logic that goes with them (and other
out-of-band data values) in most cases.  I guess I also started to flame
about undefined values in general.

At the risk of overstating my case, think of the manager who discovers
that only ten widgets are available (at $50,000 each) when s/he needs
twenty; s/he orders ten more from his/her supplier, ships the ten s/he has,
indicates a back-order for his/her customer, and the next time s/he checks
the database, finds out there were thirty more in room B.  Meanwhile,
his/her customer is p*ssed off at the back order and re-orders the other
ten from the competition.

OK, so that's a bit of a joke, too; my point is that MANY (not ALL)
applications shouldn't admit a default value for fields, because the
implied semantics of a field (eg widget_count) tend to suggest to the
user that there will always be a `reasonable'.  David Masterson(?sp)
suggested that it might be better to allow defaults (eg NULLs) at
data entry time (since, presumably, there are many cases where the
data entry person may be unable to supply the missing value, and may
even do something weird like enter 5 just to complete the form).  I
would claim that the best thing to do is bounce the form back to the
originator for fixing.

>That was a joke, I presume you realize, but how does one get around
>this summation anomaly?  It seems you need to have something analogous
>to the following:
>	If all widgetcounts are positive then X=sum(widgetcount)
>	otherwise X=-1.
>Furthermore, if you have a large number of conventions like this
>more complex queries (especially if you have complex views) can
>become quite baroque.  If you use actual null values, however,
>the queries are still somewhat baroque, but easier to understand.
>
>A personal interest of mine is the question of whether the
>programmer can ignore the possibility of nulls entirely and have
>the system handle them in a logically uniform way -- which, I
>argue is not currently done automatically.		-aaron.

I would agree; I think your If... is the best solution (ie, if there
are NULLs or other out-of-band values in the set, then the sum,
average, etc should probably be out-of-band as well).  I don't think
this is the way the standard reads; I'm also fairly sure that

select sum(widget_count) from fooey

and

select avg(widget_count) from fooey

will behave as though the rows with NULLs in widget count were not really
in the database.  Certainly this is logical; whether you can therefore
ignore the presence of NULLs is really up to you :-)

Chris Hermansen                         Timberline Forest Inventory Consultants
Voice: 1 604 733 0731                   302 - 958 West 8th Avenue
FAX:   1 604 733 0634                   Vancouver B.C. CANADA
clh@tfic.bc.ca                          V5Z 1E5

C'est ma facon de parler.

davidm@uunet.UU.NET (David S. Masterson) (10/03/90)

In article <30357@netnews.upenn.edu> aaron@grad2.cis.upenn.edu (Aaron Watters)
writes:

   In article <CIMSHOP!DAVIDM.90Sep30013925@uunet.UU.NET> (I) write:
   > 
   >	all people who definitely have an age = 24
   >			or
   >	all people who MAYBE have an age = 24		?
   >

   First, why do you say these are not different aspects to the correct
   answer to a single query?  

First, I'm not sure I understand your statement.  Second, the answer to the
first query is a subset of the answer to the second query.  These two queries
have different meanings where it just so happens that their results are in the
same domain.  If the second query had been "all people who definitely have an
age = 36", then they would have been recognized as two different queries.  The
same is true of the above queries.

			Imagine, if you will, that I pose a very complex
   query built up from complex views whose definitions I do not know.

Then, as far as you (or the model) are concerned, the views can (and should)
be treated as basic relations.

								Suppose
   I want to know those tuples known to satisfy the query and those not-known
   to not-satisfy the query.

The double negative here is throwing me.

   Do I have to pose two queries?

Perhaps, depending on what the double negative means.  On the other hand,
MAYBE qualification results in a superset of definite answer.

   Do each of the views I use have to be defined twice, once for definiteness
   and once for maybeness?  Why?

The VIEW construction is done with primitive operations (like those at the
beginning of the message).  If you want values in the view that are not
precise, then use the MAYBE qualifier, but that will still include the
definite subset.

   Roughly, I would argue that it's okay if the system
   allows the user to ask for maybes and definites, but if s/he doesn't
   specify s/he should get both.

Precisely. 

   Second, note that according to the `semantic completeness' metric
   no reasonable database system will fill the bill -- because the problem is
   NP-complete for a fixed query and varying data. (Teaser.)

You seem to have been making the assumption that definitiveness and maybeness
are distinct sets.  I contend that maybeness is a superset of definitiveness.
What does that do to the NP-complete problem (I never really understood NP
terminology). 

   I disagree that the `relational paradigm' includes `three valued logic' any
   more sensibly than Pascal `includes Prolog.' (IE, you can certainly
   implement one in the other, but that's beside the point.)

Set theory, by definition, must contend with NULL (especially where domains
are concerned) and the NULL-set.  If operations are defined upon sets (as the
relational model does), then those operations must be consistent where NULL is
concerned.  Three-valued logic is the expression of how to specify the results
of standard operators with "unknown" (NULL) operands.  Therefore, three-valued
logic should be a part of the definition of operations against sets.  So, to
be consistent, the relational 'paradigm' should deal with NULLs completely and
consistently.

Having said that, I must say that I am speaking only of the model and not any
one particular implementation.  No implementation (to my knowledge) is yet
complete where NULLs are concerned (can they, therefore, be called "completely
relational"?).
--
====================================================================
David Masterson					Consilium, Inc.
uunet!cimshop!davidm				Mtn. View, CA  94043
====================================================================
"If someone thinks they know what I said, then I didn't say it!"

asylvain@felix.UUCP (Alvin E. Sylvain) (10/05/90)

In article <CIMSHOP!DAVIDM.90Sep30013925@uunet.UU.NET> cimshop!davidm@uunet.UU.NET (David S. Masterson) writes:
[ ... ]
>This is better known as 3-valued logic which is a part of the relational model
>(as is 4-valued logic).  Both answers (to both problems) are syntactically
>correct, the true answer is determined by the semantics of the problem (which
>you touched on at the end).
[ ... ]
>BTW, do current OODBs treat 3 or 4 valued logic?  I imagine they could, but do
>current implementations even address this?

Whoa, wait a minute!  Can you explain to this meager humble dum-dum
what is 4 valued logic?  If this is explained much deeply in some
periodic list of FAQ that I failed to locate, please to forgive!
--
=======================Standard Disclaimers Apply=======================
"We're sorry, but the reality you have dialed is no   |            Alvin
longer in service.  Please check the value of pi,     |   "the Chipmunk"
or pray to your local diety for assistance."          |          Sylvain

lugnut@sequent.UUCP (Don Bolton) (10/06/90)

In article <1990Oct1.070456.17087@informix.com> aland@informix.com (alan denney) writes:
>In article <42743@sequent.UUCP> lugnut@sequent.UUCP (Don Bolton) writes:
>|In article <1990Sep20.161204.13014@tfic.bc.ca> clh@tacitus.UUCP (Chris Hermansen) writes:
>|>In article <CIMSHOP!DAVIDM.90Sep19100915@uunet.UU.NET> cimshop!davidm@uunet.UU.NET (David S. Masterson) writes:
>|>>In article <PCG.90Sep18192004@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
>|>>(Piercarlo Grandi) writes:
>|>
>|>NULLs almost guarantee strange `gotchas', don't you think?  For instance,
A whole lot of stuff deleted.

>This has to do with the very definition of NULL -- it requires that
>NULL values fail any "not equal" filter.  Since a NULL is by definition
>an "unknown" value, then you don't know that it's not 0, right?
>
>In any case, you can avoid the whole issue by disallowing NULLs in the
>column(s) in question (by creating the column with NOT NULL).

THis works great in many applications, however a leads tracking system
or something that has many possibilities for empty columns like that is
still a problem. NOT NULL is sometimes NOT FEASABLE :-)

>|>It's also really handy that 
>|>
>|>	select qqsv,count(*) from foo
>|>	group by qqsv
>|>
>|>produces a separate line for every row in foo that has a null in qqsv,
>|>don't you think?
>|>
>|ARRGGH, I hate it when that happens, have to update the nulls with the
>|string value "NULL" run the select and then "unupdate" or run several
>|selects and cut and paste. Either way its messy.
>
>There was quite a bit of theoretical debate about which method was
>"proper".  Now that ANSI has spoken, Informix engines have been changed
>to group NULL values together (the way you apparently prefer).  This
>change is effective in the current release - 4.0.
>
And there was *much* rejoicing. 

>|Also with informix, a select count(*) from foo where qqsv is null
>|will produce a SINGLE row with the count. Can you say inconsistant?
>|I knew you could :-)
>
>Uh, can you say "pilot error"?  [-:  
>
>How can you call that inconsistent?  The response you complain about
>is exactly what you asked for: how many rows have NULL for qqsv.
>No GROUPing is involved.
>
Neither was any grouping involved in the other example where nulls were
concerned. :-)

But you see the result from the select count(*) from where null example
was what I had been raised to expect on the group by result. To my primal
training they BOTH should have returned a SINGLE row with the count.

anyways its moot since its now fixed in the curent release.

aaron@grad2.cis.upenn.edu (Aaron Watters) (10/08/90)

One variety of 4 valued logic is unknown/true/false/error

eg.

`bush is president' is true.
`dukakis is president' is false.
`all space aliens are green' is unknown.
`the king of North America is bald' is error.

unknown means `this may be true or false or error, but we don't
  know which.'

error means `this can't have any truth value' -- in the example
  above because any truth value requires the existence of someone
  who (last I heard) doesn't exist.		-aaron

davidm@uunet.UU.NET (David S. Masterson) (10/08/90)

In article <152008@felix.UUCP> asylvain@felix.UUCP (Alvin E. Sylvain) writes:

   Whoa, wait a minute!  Can you explain to this meager humble dum-dum
   what is 4 valued logic?  If this is explained much deeply in some
   periodic list of FAQ that I failed to locate, please to forgive!

Hehe, yea, it threw me for awhile when I first saw it in Codd's new book (its
still a bit confusing, but I really haven't studied it).  Basically, as I
understand it, it breaks down like this:

2-valued logic: defines the results of applying relational operators to TRUE
or FALSE operands.

3-valued logic: same as 2-valued logic except operands might be TRUE, FALSE,
or UNKNOWN.

4-valued logic: same as 3-valued logic except operands might be TRUE, FALSE,
NOT-APPLICABLE, UNKNOWN-BUT-APPLICABLE (forget if that's the right
terminology). 

Check Codd's book, "The Relational Model for Database Management Version 2"
for more in-depth reasoning on 4-valued logic's applicability to relational
databases. 
--
====================================================================
David Masterson					Consilium, Inc.
uunet!cimshop!davidm				Mtn. View, CA  94043
====================================================================
"If someone thinks they know what I said, then I didn't say it!"

Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) (10/08/90)

Just a nit:

>>>>> On 7 Oct 90 20:23:37 GMT, aaron@grad2.cis.upenn.edu (Aaron Watters) said:
Aaron> unknown means `this may be true or false or error, but we don't
Aaron>   know which.'

Aaron> error means `this can't have any truth value'

So far so good.

Aaron>   -- in the example above ["the king of North America is bald"]
Aaron>   because any truth value requires the existence of someone who
Aaron>   (last I heard) doesn't exist.

Clairification: A statement requiring the existance of the non-existant to
be true, is simply false, not an error.  However, a domain violation _is_
an error.  (e.g. "I am green years old.")

If "king" is an out-of-band attribute, then the statement is an error; if
"king" is within the defined domain, then the statement is simply false.

	Hope this helps,

Chuck Phillips  MS440
NCR Microelectronics 			chuck.phillips%ftcollins.ncr.com
2001 Danfield Ct.
Ft. Collins, CO.  80525   		...uunet!ncrlnk!ncr-mpd!bach!chuckp
--
Chuck Phillips  MS440
NCR Microelectronics 			chuck.phillips%ftcollins.ncr.com
2001 Danfield Ct.
Ft. Collins, CO.  80525   		...uunet!ncrlnk!ncr-mpd!bach!chuckp