[comp.databases] Ingres Question: Duplicate Rows

steve@violet.berkeley.edu (Steve Goldfield) (08/09/90)

I've just been debugging a discrepancy between two reports.
In one report, the table it drew from had enough columns to
have all rows unique. In the other report, I only took a few
columns because I was just looking for cumulative totals. I
found the discrepancy caused by Ingres' suppression of
duplicate rows.

The Ingres/Quel reference manual (we are using Release 5.0)
says that duplicate rows are not allowed except in a heap
structure. The table in question was a heap, but duplicates
appear to have been suppressed anyway.

Does anyone know whether the unwanted suppression of duplicate
rows is a bug or a feature? If it is a feature, it seems
an undesireable one to me, since I can think of lots of
situations where I would like to allow the possibility of
duplicate rows. It seems bothersome to have to include an
unwanted column in a table just to keep the rows unique.

Any comments? Is there (should there be) a user flag to
either turn on or turn off duplicate permission?

Steve Goldfield
College of Engineering
UC Berkeley

mao@eden (Mike Olson) (08/09/90)

In <1990Aug8.230913.2897@agate.berkeley.edu>, steve@violet.berkeley.edu
(Steve Goldfield) writes:

> Does anyone know whether the unwanted suppression of duplicate
> rows is a bug or a feature? If it is a feature, it seems
> an undesireable one to me, since I can think of lots of
> situations where I would like to allow the possibility of
> duplicate rows. It seems bothersome to have to include an
> unwanted column in a table just to keep the rows unique.

i can't comment on what ingres is doing in its heap access method, but
as a general statement, duplicate rows in relational systems are a major
no-no.  codd explains why in grueling detail in his books; basically,
it comes down to the fact that different access paths can yield different
results if duplicates are permitted.  the fact that most commercial systems
(well, okay, all commercial systems) permit dups is testimony to the
difficulty of detecting and removing them, and not to their desirability.

you shouldn't be encoding meaning into dups.  you can add a "number of
occurrences" column to rows in the table if you really want to know how
many occurrences there are.

					mike olson
					postgres research group
					uc berkeley
					mao@postgres.berkeley.edu

bg0l+@andrew.cmu.edu (Bruce E. Golightly) (08/09/90)

Sounds like the report used a sort. This is a "feature" of sorting that
turns up in all sorts of irritating places, not only Ingres.

One work-around is to retrieve something that is sure to be unique, like
the tuple ID, and simply not write it in the report.

#############################################################################
                                     !                                      
Bruce E. Golightly                   !  bg0l@andrew.cmu.edu (BeeGeeZeroEl)  
Chief of Data Base Development       !  (412)268-8560                       
Telecommunications Group             !                                      
Carnegie Mellon University           !  UCC 117, 4910 Forbes Ave.           
                                     !  Pittsburgh PA 15213-3890           
President, Western Penna IUA         !
#############################################################################

lindahl@violet.berkeley.edu (Ken Lindahl 642-0866) (08/09/90)

In article <1990Aug8.230913.2897@agate.berkeley.edu> steve@violet.berkeley.edu (Steve Goldfield) writes:
>...
>The Ingres/Quel reference manual (we are using Release 5.0)
>says that duplicate rows are not allowed except in a heap
>structure. The table in question was a heap, but duplicates
>appear to have been suppressed anyway.

This is just a guess, but let's see if this is the same feature
that bit me. Are you doing a "retrieve ... sort by ..."? If so,
the result is stored as a cheapsort and duplicate rowd are removed
(cf. the "Retrieve" in INGRES/QUEL Reference Manual.

>Does anyone know whether the unwanted suppression of duplicate
>rows is a bug or a feature? 

If my guess is correct, it's a feature, since it's in the manual, right? :-)

>...
>Steve Goldfield
>College of Engineering
>UC Berkeley

Ken Lindahl				lindahl@violet.berkeley.edu
Advanced Technology Planning,
Information Systems and Technology
University of California at Berkeley

miket@blia.BLI.COM (Mike Tossy) (08/11/90)

In article <26867@pasteur.Berkeley.EDU>, mao@eden (Mike Olson) writes:
 
> i can't comment on what ingres is doing in its heap access method, but
> as a general statement, duplicate rows in relational systems are a major
> no-no.  codd explains why in grueling detail in his books; basically,
> it comes down to the fact that different access paths can yield different
> results if duplicates are permitted.  the fact that most commercial systems
> (well, okay, all commercial systems) permit dups is testimony to the
> difficulty of detecting and removing them, and not to their desirability.

The ANSI Standard specificly allows duplicates. This is probably unfortunate.
(Luckily :->  they aren't very useful since that the standard provides no 
way to interactively update only one of them.)  This is one of those areas
where QUEL semantics and SQL semantics are incompatable.  We (in Britton Lee
days) also had a QUEL based engine and this kind of problem was common.  I
suspect that, like ourselves, the Ingres folks are having to fix their RDBMS
engine to not drop duplicates.

(I'm not sure I like the ANSI SQL semantics but at least they are consistant;
something the QUEL semantics never were.)
 
> you shouldn't be encoding meaning into dups.  you can add a "number of
> occurrences" column to rows in the table if you really want to know how
> many occurrences there are.

Absolutely correct!  Duplicates in base tables are a poor idea even when
permitted.  However unless you really pay attention, it is fairly easy to
create temporary derived tables (say for decision support) which do have
duplicates because you didn't include all the base tables' primary keys.
The SQL semantics are much more likely to give you the desiered answer in
this case.
                                                Teradata Corporation
Mike Tossy                                      ShareBase Division
miket@blia.bli.com                              14600 Wichester Blvd
(408) 378-7575 ext2200                          Los Gatos, CA 95030
					(Formerly: Britton Lee, Inc.)

	 These are only my opinions.

bg0l+@andrew.cmu.edu (Bruce E. Golightly) (09/24/90)

While the definition of RDBMS may preclude duplicate rows, the real world
certainly does not! During the original development of the telecommunications
data bases for the university (over four years ago!), this fact cost 
me some gray hairs that I could ill afford. 

The granularityh of the telephone reporting mechanisms available from
Bell permits apparent duplicates. The development of an RDBMS based
system to replace COBOL and flat files was held up at one point because
there really were duplicates in the data. As far as the phone system was
concerned, there were two calls that originated from the same extension
to the same external number at the same time and with the same duration.
This seemingly impossible
phenomenon was the result of two extremely short calls. Other examples
can surely b contributed by the community.

The ANSI standard allows duplicates because the real world must cope
with things that are identical for all intents and purposes. If the
model doesn't allow the developer and/or user to get close enough to
handling the real world, it must ultimately be discarded.

gordon@meaddata.com (Gordon Edwards) (09/26/90)

In article <UazGSjK00Uh_M1XWMj@andrew.cmu.edu>, bg0l+@andrew.cmu.edu
(Bruce E. Golightly) writes:
|> While the definition of RDBMS may preclude duplicate rows, the real world
|> certainly does not! During the original development of the
telecommunications
|> data bases for the university (over four years ago!), this fact cost 
|> me some gray hairs that I could ill afford. 
|> 
[example deleted. -gordon]
|> 
|> The ANSI standard allows duplicates because the real world must cope
|> with things that are identical for all intents and purposes. If the
|> model doesn't allow the developer and/or user to get close enough to
|> handling the real world, it must ultimately be discarded.

The relational model is based heavily on set theory, which, as you probably 
know, doesn't allow/consider duplicates.  To satisfy the relational model and
handle duplicates at the same time, provide a system maintained (or surrogate)
key at the time the record is inserted into the table.  Then, use the surrogate
to perform foreign key mapping.  A number of DBMS products (eg. Informix, 
Ingres) have built-in system maintained keys.

-- Gordon     (gordon@meaddata.com)

cdm@gem-hy.Berkeley.EDU (Dale Cook) (09/28/90)

In article <UazGSjK00Uh_M1XWMj@andrew.cmu.edu>, bg0l+@andrew.cmu.edu
(Bruce E. Golightly) writes:
|> While the definition of RDBMS may preclude duplicate rows, the real world
|> certainly does not! During the original development of the
telecommunications
|> data bases for the university (over four years ago!), this fact cost 
|> me some gray hairs that I could ill afford. 
|> 
|> The granularityh of the telephone reporting mechanisms available from
|> Bell permits apparent duplicates. The development of an RDBMS based
|> system to replace COBOL and flat files was held up at one point because
|> there really were duplicates in the data. As far as the phone system was
|> concerned, there were two calls that originated from the same extension
|> to the same external number at the same time and with the same duration.
|> This seemingly impossible
|> phenomenon was the result of two extremely short calls. Other examples
|> can surely b contributed by the community.
|> 
|> The ANSI standard allows duplicates because the real world must cope
|> with things that are identical for all intents and purposes. If the
|> model doesn't allow the developer and/or user to get close enough to
|> handling the real world, it must ultimately be discarded.

Sorry, but I disagree.  If you have two identical rows, you have not
sufficiently identified your primary key.  Data analysis techniques
I am familiar with would indicate a dumb sequence number as the best
alternative for the situation you describe.

Note that there is no requirement that you ever use this primary key,
but its existence is a logical requirement, for lack of a better one.

Dale Cook

<Sorry about the length of the disclaimer which follows>

Neither the United States Government nor the Idaho National
Engineering Laboratory nor any of their employees, makes any warranty,
expressed or implied, or assumes any legal liability or responsibility
for the accuracy, completeness, or usefulness of any information,
product, or process disclosed, or represents that its use would not
infringe privately owned rights.  Reference herein to any specific
commercial products, process, or service by trade name, trademark
manufacturer, or otherwise, does not necessarily constitute or imply
its endorsement, recommendation, or favoring by the United States
Government or the Idaho National Engineering Laboratory.  The views and
opinions of authors expressed herein do not necessarily state or reflect
those of the United States Government nor the Idaho National Engineering
Laboratory, and shall not be used for advertising or product endorsement
purposes.

bg0l+@andrew.cmu.edu (Bruce E. Golightly) (09/28/90)

A number of people have pointed out the obvious - use some form of surrogate
key to resolve the duplicate row problem. This is, obviously, exactly
what I did. The real point of the original message was that the real
world really does have to cope with duplicate items. Any model that does
not permit the
existance of duplicates has to be "fixed" to allow the handling of real
cases. If it cannot be fixed/hacked/monkeyed with. this it should be
discarded.
 

jeff@Unify.com (Jeff Mischkinsky) (10/05/90)

In article <12314@blia.BLI.COM>, miket@blia.BLI.COM (Mike Tossy) writes:
> In article <26867@pasteur.Berkeley.EDU>, mao@eden (Mike Olson) writes:
>  
> > ...   the fact that most commercial systems
> > (well, okay, all commercial systems) permit dups is testimony to the
> > difficulty of detecting and removing them, and not to their desirability.

Actually I think the reason we don't remove them is beacause our customers
demand them.  There is nothing particularly hard about detecting and
removing duplicates, although it may be costly.  If you don't want
dups then use DISTINCT, UNIQUE, or whatever keyword your favorite
query language supports in you queries, and/or specify NOT NULL, UNIQUE
in your schema definitions.
> 
> The ANSI Standard specificly allows duplicates. This is probably unfortunate.
> (Luckily :->  they aren't very useful since that the standard provides no 
> way to interactively update only one of them.)  This is one of those areas

Although you can update them individually using a cursor.

...deleted...

> Absolutely correct!  Duplicates in base tables are a poor idea even when
> permitted.  However unless you really pay attention, it is fairly easy to
> create temporary derived tables (say for decision support) which do have
> duplicates because you didn't include all the base tables' primary keys.
> The SQL semantics are much more likely to give you the desiered answer in
> this case.
If you really care use DISTINCT to get rid of the dups.

--
Jeff Mischkinsky		internet: jeff@unify.com
Unify Corporation		          ...!{pyramid,csusac}!unify!jeff
3870 Rosin Court		voice: (916) 920-9092 fax: (916) 921-5340
Sacramento, CA 95834		ICBMS: 38 38 40 N / 120 28 10 W

dc9l+@andrew.cmu.edu (David H. Campbell) (10/06/90)

The issue with duplicates for me (personally, not officially) is that
they are often needed in order to model reality.  In some cases there
may be two exactly duplicate transactions which are BOTH important.  It
is not proper (in the REAL world, not some relational theory) to
ARBITRARILY delete or ignore one of them.  This type of behavior makes
it virtually impossible to use a relational system for financial
applications, and even when workarounds are possible the most powerful
functions of SQL are not usable.  I don't personally advocate having
duplicate transactions in some cases, but as a designer I have to work
with the situation as it exists.  I don't think it's proper for a
relational system to IMPOSE a rigid methodology on the world.  Computers
are supposed to serve people and organizations, not vice versa.


                                                David Campbell

P.S. These opinions are my own and not necessarily my employer's.