steve@violet.berkeley.edu (Steve Goldfield) (08/09/90)
I've just been debugging a discrepancy between two reports. In one report, the table it drew from had enough columns to have all rows unique. In the other report, I only took a few columns because I was just looking for cumulative totals. I found the discrepancy caused by Ingres' suppression of duplicate rows. The Ingres/Quel reference manual (we are using Release 5.0) says that duplicate rows are not allowed except in a heap structure. The table in question was a heap, but duplicates appear to have been suppressed anyway. Does anyone know whether the unwanted suppression of duplicate rows is a bug or a feature? If it is a feature, it seems an undesireable one to me, since I can think of lots of situations where I would like to allow the possibility of duplicate rows. It seems bothersome to have to include an unwanted column in a table just to keep the rows unique. Any comments? Is there (should there be) a user flag to either turn on or turn off duplicate permission? Steve Goldfield College of Engineering UC Berkeley
mao@eden (Mike Olson) (08/09/90)
In <1990Aug8.230913.2897@agate.berkeley.edu>, steve@violet.berkeley.edu (Steve Goldfield) writes: > Does anyone know whether the unwanted suppression of duplicate > rows is a bug or a feature? If it is a feature, it seems > an undesireable one to me, since I can think of lots of > situations where I would like to allow the possibility of > duplicate rows. It seems bothersome to have to include an > unwanted column in a table just to keep the rows unique. i can't comment on what ingres is doing in its heap access method, but as a general statement, duplicate rows in relational systems are a major no-no. codd explains why in grueling detail in his books; basically, it comes down to the fact that different access paths can yield different results if duplicates are permitted. the fact that most commercial systems (well, okay, all commercial systems) permit dups is testimony to the difficulty of detecting and removing them, and not to their desirability. you shouldn't be encoding meaning into dups. you can add a "number of occurrences" column to rows in the table if you really want to know how many occurrences there are. mike olson postgres research group uc berkeley mao@postgres.berkeley.edu
bg0l+@andrew.cmu.edu (Bruce E. Golightly) (08/09/90)
Sounds like the report used a sort. This is a "feature" of sorting that turns up in all sorts of irritating places, not only Ingres. One work-around is to retrieve something that is sure to be unique, like the tuple ID, and simply not write it in the report. ############################################################################# ! Bruce E. Golightly ! bg0l@andrew.cmu.edu (BeeGeeZeroEl) Chief of Data Base Development ! (412)268-8560 Telecommunications Group ! Carnegie Mellon University ! UCC 117, 4910 Forbes Ave. ! Pittsburgh PA 15213-3890 President, Western Penna IUA ! #############################################################################
lindahl@violet.berkeley.edu (Ken Lindahl 642-0866) (08/09/90)
In article <1990Aug8.230913.2897@agate.berkeley.edu> steve@violet.berkeley.edu (Steve Goldfield) writes: >... >The Ingres/Quel reference manual (we are using Release 5.0) >says that duplicate rows are not allowed except in a heap >structure. The table in question was a heap, but duplicates >appear to have been suppressed anyway. This is just a guess, but let's see if this is the same feature that bit me. Are you doing a "retrieve ... sort by ..."? If so, the result is stored as a cheapsort and duplicate rowd are removed (cf. the "Retrieve" in INGRES/QUEL Reference Manual. >Does anyone know whether the unwanted suppression of duplicate >rows is a bug or a feature? If my guess is correct, it's a feature, since it's in the manual, right? :-) >... >Steve Goldfield >College of Engineering >UC Berkeley Ken Lindahl lindahl@violet.berkeley.edu Advanced Technology Planning, Information Systems and Technology University of California at Berkeley
miket@blia.BLI.COM (Mike Tossy) (08/11/90)
In article <26867@pasteur.Berkeley.EDU>, mao@eden (Mike Olson) writes: > i can't comment on what ingres is doing in its heap access method, but > as a general statement, duplicate rows in relational systems are a major > no-no. codd explains why in grueling detail in his books; basically, > it comes down to the fact that different access paths can yield different > results if duplicates are permitted. the fact that most commercial systems > (well, okay, all commercial systems) permit dups is testimony to the > difficulty of detecting and removing them, and not to their desirability. The ANSI Standard specificly allows duplicates. This is probably unfortunate. (Luckily :-> they aren't very useful since that the standard provides no way to interactively update only one of them.) This is one of those areas where QUEL semantics and SQL semantics are incompatable. We (in Britton Lee days) also had a QUEL based engine and this kind of problem was common. I suspect that, like ourselves, the Ingres folks are having to fix their RDBMS engine to not drop duplicates. (I'm not sure I like the ANSI SQL semantics but at least they are consistant; something the QUEL semantics never were.) > you shouldn't be encoding meaning into dups. you can add a "number of > occurrences" column to rows in the table if you really want to know how > many occurrences there are. Absolutely correct! Duplicates in base tables are a poor idea even when permitted. However unless you really pay attention, it is fairly easy to create temporary derived tables (say for decision support) which do have duplicates because you didn't include all the base tables' primary keys. The SQL semantics are much more likely to give you the desiered answer in this case. Teradata Corporation Mike Tossy ShareBase Division miket@blia.bli.com 14600 Wichester Blvd (408) 378-7575 ext2200 Los Gatos, CA 95030 (Formerly: Britton Lee, Inc.) These are only my opinions.
bg0l+@andrew.cmu.edu (Bruce E. Golightly) (09/24/90)
While the definition of RDBMS may preclude duplicate rows, the real world certainly does not! During the original development of the telecommunications data bases for the university (over four years ago!), this fact cost me some gray hairs that I could ill afford. The granularityh of the telephone reporting mechanisms available from Bell permits apparent duplicates. The development of an RDBMS based system to replace COBOL and flat files was held up at one point because there really were duplicates in the data. As far as the phone system was concerned, there were two calls that originated from the same extension to the same external number at the same time and with the same duration. This seemingly impossible phenomenon was the result of two extremely short calls. Other examples can surely b contributed by the community. The ANSI standard allows duplicates because the real world must cope with things that are identical for all intents and purposes. If the model doesn't allow the developer and/or user to get close enough to handling the real world, it must ultimately be discarded.
gordon@meaddata.com (Gordon Edwards) (09/26/90)
In article <UazGSjK00Uh_M1XWMj@andrew.cmu.edu>, bg0l+@andrew.cmu.edu (Bruce E. Golightly) writes: |> While the definition of RDBMS may preclude duplicate rows, the real world |> certainly does not! During the original development of the telecommunications |> data bases for the university (over four years ago!), this fact cost |> me some gray hairs that I could ill afford. |> [example deleted. -gordon] |> |> The ANSI standard allows duplicates because the real world must cope |> with things that are identical for all intents and purposes. If the |> model doesn't allow the developer and/or user to get close enough to |> handling the real world, it must ultimately be discarded. The relational model is based heavily on set theory, which, as you probably know, doesn't allow/consider duplicates. To satisfy the relational model and handle duplicates at the same time, provide a system maintained (or surrogate) key at the time the record is inserted into the table. Then, use the surrogate to perform foreign key mapping. A number of DBMS products (eg. Informix, Ingres) have built-in system maintained keys. -- Gordon (gordon@meaddata.com)
cdm@gem-hy.Berkeley.EDU (Dale Cook) (09/28/90)
In article <UazGSjK00Uh_M1XWMj@andrew.cmu.edu>, bg0l+@andrew.cmu.edu (Bruce E. Golightly) writes: |> While the definition of RDBMS may preclude duplicate rows, the real world |> certainly does not! During the original development of the telecommunications |> data bases for the university (over four years ago!), this fact cost |> me some gray hairs that I could ill afford. |> |> The granularityh of the telephone reporting mechanisms available from |> Bell permits apparent duplicates. The development of an RDBMS based |> system to replace COBOL and flat files was held up at one point because |> there really were duplicates in the data. As far as the phone system was |> concerned, there were two calls that originated from the same extension |> to the same external number at the same time and with the same duration. |> This seemingly impossible |> phenomenon was the result of two extremely short calls. Other examples |> can surely b contributed by the community. |> |> The ANSI standard allows duplicates because the real world must cope |> with things that are identical for all intents and purposes. If the |> model doesn't allow the developer and/or user to get close enough to |> handling the real world, it must ultimately be discarded. Sorry, but I disagree. If you have two identical rows, you have not sufficiently identified your primary key. Data analysis techniques I am familiar with would indicate a dumb sequence number as the best alternative for the situation you describe. Note that there is no requirement that you ever use this primary key, but its existence is a logical requirement, for lack of a better one. Dale Cook <Sorry about the length of the disclaimer which follows> Neither the United States Government nor the Idaho National Engineering Laboratory nor any of their employees, makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial products, process, or service by trade name, trademark manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or the Idaho National Engineering Laboratory. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government nor the Idaho National Engineering Laboratory, and shall not be used for advertising or product endorsement purposes.
bg0l+@andrew.cmu.edu (Bruce E. Golightly) (09/28/90)
A number of people have pointed out the obvious - use some form of surrogate key to resolve the duplicate row problem. This is, obviously, exactly what I did. The real point of the original message was that the real world really does have to cope with duplicate items. Any model that does not permit the existance of duplicates has to be "fixed" to allow the handling of real cases. If it cannot be fixed/hacked/monkeyed with. this it should be discarded.
jeff@Unify.com (Jeff Mischkinsky) (10/05/90)
In article <12314@blia.BLI.COM>, miket@blia.BLI.COM (Mike Tossy) writes: > In article <26867@pasteur.Berkeley.EDU>, mao@eden (Mike Olson) writes: > > > ... the fact that most commercial systems > > (well, okay, all commercial systems) permit dups is testimony to the > > difficulty of detecting and removing them, and not to their desirability. Actually I think the reason we don't remove them is beacause our customers demand them. There is nothing particularly hard about detecting and removing duplicates, although it may be costly. If you don't want dups then use DISTINCT, UNIQUE, or whatever keyword your favorite query language supports in you queries, and/or specify NOT NULL, UNIQUE in your schema definitions. > > The ANSI Standard specificly allows duplicates. This is probably unfortunate. > (Luckily :-> they aren't very useful since that the standard provides no > way to interactively update only one of them.) This is one of those areas Although you can update them individually using a cursor. ...deleted... > Absolutely correct! Duplicates in base tables are a poor idea even when > permitted. However unless you really pay attention, it is fairly easy to > create temporary derived tables (say for decision support) which do have > duplicates because you didn't include all the base tables' primary keys. > The SQL semantics are much more likely to give you the desiered answer in > this case. If you really care use DISTINCT to get rid of the dups. -- Jeff Mischkinsky internet: jeff@unify.com Unify Corporation ...!{pyramid,csusac}!unify!jeff 3870 Rosin Court voice: (916) 920-9092 fax: (916) 921-5340 Sacramento, CA 95834 ICBMS: 38 38 40 N / 120 28 10 W
dc9l+@andrew.cmu.edu (David H. Campbell) (10/06/90)
The issue with duplicates for me (personally, not officially) is that they are often needed in order to model reality. In some cases there may be two exactly duplicate transactions which are BOTH important. It is not proper (in the REAL world, not some relational theory) to ARBITRARILY delete or ignore one of them. This type of behavior makes it virtually impossible to use a relational system for financial applications, and even when workarounds are possible the most powerful functions of SQL are not usable. I don't personally advocate having duplicate transactions in some cases, but as a designer I have to work with the situation as it exists. I don't think it's proper for a relational system to IMPOSE a rigid methodology on the world. Computers are supposed to serve people and organizations, not vice versa. David Campbell P.S. These opinions are my own and not necessarily my employer's.