[comp.databases] Duplicated tuples problem

csa18@seq1.keele.ac.uk (R.J. Husmo) (12/21/90)

Q: Is there a way to automatically delete duplicate tuples from a
table/all tables in a db using SQL?

And why do relational dbs allow duplicate tuples, anyway?

davidm@uunet.UU.NET (David S. Masterson) (12/22/90)

>>>>> On 21 Dec 90 10:13:14 GMT, csa18@seq1.keele.ac.uk (R.J. Husmo) said:

R.J.> Q: Is there a way to automatically delete duplicate tuples from a
R.J.> table/all tables in a db using SQL?

The typical way is to put a unique index on the primary key of each table.

R.J.> And why do relational dbs allow duplicate tuples, anyway?

A tradeoff between performance and soundness of model.  Without an index on
the primary key, its typically difficult to determine that a duplicate row has
been entered into the system with sequentially searching the table.  This can
be a rather time-consuming operation for every update to take.  The same
effect can be achieved by a unique index or constraint on the primary key of
the table.
--
====================================================================
David Masterson					Consilium, Inc.
(415) 691-6311					640 Clyde Ct.
uunet!cimshop!davidm				Mtn. View, CA  94043
====================================================================
"If someone thinks they know what I said, then I didn't say it!"

ben@vamp.sybase.com (ben ullrich) (12/30/90)

In article <CIMSHOP!DAVIDM.90Dec21095239@uunet.UU.NET> cimshop!davidm@uunet.UU.NET (David S. Masterson) writes:
>>>>>> On 21 Dec 90 10:13:14 GMT, csa18@seq1.keele.ac.uk (R.J. Husmo) said:
>
>R.J.> Q: Is there a way to automatically delete duplicate tuples from a
>R.J.> table/all tables in a db using SQL?
>
>The typical way is to put a unique index on the primary key of each table.

i don't think this answers his/her question.  it sounds to me like s/he has
duplicates in his/her tables and needs to get rid of them.  trying to add
a unique index won't do it, since the duplicates have to be gone before the
index may be successfully built.  i agree that using a unique index on a key in
any given table is the best way to go, as it then also conforms to normal form
rules.

in short, there is no *automatic* method to remove duplicates, short of
programming your database engine to silently reject duplicates as they are
inserted, or by letting a unique index raise an exception in the same
instance.  you can, however, run queries against your database to expose and/or
clean out the duplicates as you see fit.

the following article was posted just about a year ago, when this question was
last asked in this forum.  i hope mr. may doesn't mind its reposting without his
permission.

|>Article 3144 of comp.databases:
|>Path: sybase!mtxinu!ucbvax!tut.cis.ohio-state.edu!purdue!decwrl!shlump.nac.dec.com!arkham.enet.dec.com!nmeser!may
|>From: may@28182.dec.com (Patrick May)
|>Newsgroups: comp.databases
|>Subject: Re: SQL query for duplicates
|>Message-ID: <36@arkham.enet.dec.com>
|>Date: 22 Dec 89 17:16:58 GMT
|>Sender: news@arkham.enet.dec.com
|>Organization: Digital Equipment Corporation
|>
|>     I can think of two ways to remove duplicates from an Oracle database
|>table.  The first will work with any SQL-based database:
|>
|>          1.  Create a temporary table with the same structure as the
|>              one containing the duplicates.
|>
|>          2.  INSERT INTO new_tab
|>              SELECT DISTINCT *
|>              FROM dup_tab;
|>
|>          3.  Delete all records from dup_tab and refill it from new_tab.
|>
|>     This works if entire rows are duplicated.  If you just want to find
|>duplicate values in a particular field, the following should be helpful:
|>
|>          SELECT DISTINCT col_name, count(*)
|>          FROM dup_tab
|>          GROUP BY col_name
|>          HAVING count(*) > 1;
|>
|>     This will display all duplicate values for col_name and the number of
|>times that value appears in dup_tab.


..ben
------
ben ullrich			 only i do the talking here -- not my employer.
ben@sybase.com			       {pyramid,pacbell,sun,lll-tis}!sybase!ben
  "why waste time learning when ignorance is instantaneous?"-hobbes on calvin

csa18@seq1.keele.ac.uk (R.J. Husmo) (01/02/91)

In article <12224@sybase.sybase.com> you write:
>In article <CIMSHOP!DAVIDM.90Dec21095239@uunet.UU.NET> cimshop!davidm@uunet.UU.NET (David S. Masterson) writes:
>>>>>>> On 21 Dec 90 10:13:14 GMT, csa18@seq1.keele.ac.uk (R.J. Husmo) said:
>>
>>R.J.> Q: Is there a way to automatically delete duplicate tuples from a
>>R.J.> table/all tables in a db using SQL?
>>
>>The typical way is to put a unique index on the primary key of each table.
>
>i don't think this answers his/her question.  it sounds to me like s/he has
>duplicates in his/her tables and needs to get rid of them.
[...]

Indeed. Spot on. Let me describe the scenario, as there may be a clever
way of solving a specific problem which does not have a general solution.

Our DBA, who has only been a DBA for a few months, was given the task of
enlarging the storage space for our DB. Having looked in the DB manuals,
he decided that the easiest/safest way to do this would be to export all
the tables and their associated data, and then replace the old DB with a 
new and larger version.

When setting up the new DB, he included all the system tables and data.
He then imported the old tables and data. This meant that some of the
system tuples were loaded twice. So when doing, for example, a 'help'
each help topic appears twice. At this point our DBA asked for help. He
spent three days going through the process above, and does not wish to do
so again. So I posted the query here, as it seemed like an interesting 
problem.

We are using ORACLE RDBMS V6.0.27.8.2, by the way.
>|>
>|>          1.  Create a temporary table with the same structure as the
>|>              one containing the duplicates.
>|>
>|>          2.  INSERT INTO new_tab
>|>              SELECT DISTINCT *
>|>              FROM dup_tab;
>|>
>|>          3.  Delete all records from dup_tab and refill it from new_tab.
>|>
We thought of doing this, but there are some problems:
	We do not know the format of the tables involved
	Nor do we know the name of all the tables involved, although
		I suspect that we can quite easily find out
	There are rather a lot of tables involved (61 candidates)
We thought of some other ways of doing it, too, some of which included
some rather creative shell script writing, but none seemed to automate 
the process sufficiently (As you pointed out in your posting).
The more I think about it, the more I think I'll convince our DBA to put
off writing his thesis and do it again. Properly, this time.

Reidar.

IAN@SLACVM.SLAC.STANFORD.EDU (01/03/91)

It was suggested that duplicate tables be created, and then an INSERT
statement with SELECT DISTINCT be used.  The problem was that the structure
of the tables involved wasn't known, and there were quite a few of them.  Why
not use a statement akin to

            Create table nodupes
            as select distinct * from table_with_dupes;

This way you do not have to know the structure beforehand.  I also did some
experimenting.
I just tried this...haven't checked it out thoroughly...
Let's say your table with duplicates looks like this

SQL> SELECT * FROM DELTEST;

COL1  COL2
----- -----
ABCDE UVXYZ
ABCDE UVXYZ
ABCDE UVXYZ
ABCDE UVXYZ
FGHIJ QPRST
FGHIJ QPRST
FGHIJ QPRST

and COL1 is the key to the table.  Create a temporary table thus

create table temp(row_id,col11)
as select rowid,col1 from deltest

so you now have


SQL> SELECT * FROM TEMP;

ROW_ID             COL1
------------------ -----
00000940.0000.0006 ABCDE
00000940.0001.0006 ABCDE
00000940.0002.0006 ABCDE
00000940.0003.0006 ABCDE
00000940.0004.0006 FGHIJ
00000940.0005.0006 FGHIJ
00000940.0006.0006 FGHIJ

7 rows selected.

obviously your values for row_id are going to differ.  Now issue the
following command


DELETE FROM DELTEST
WHERE DELTEST.COL1 IN (SELECT COL1 FROM TEMP WHERE
TEMP.ROW_ID < DELTEST.ROWID)
/

checking the deltest table now reveals


SQL> select * from deltest;

COL1  COL2
----- -----
ABCDE UVXYZ
FGHIJ QPRST

I much prefer the first suggestion, as the latter one requires that you
know the "unique" key to the table.

Ian MacGregor  Stanford Linear Accelerator Center

mao@eden.Berkeley.EDU (Mike Olson) (01/03/91)

In <775@keele.keele.ac.uk>, csa18@seq1.keele.ac.uk (R.J. Husmo) writes
(regarding the removal of duplicates from tables under oracle)

> We do not know the format of the tables involved

the query

	SELECT * FROM FOO WHERE 1 = 0

returns the target list for queries on foo; although it doesn't solve your
whole problem, it is a useful query to know when you want to know what a
table contains without looking at any tuples.

					mike olson
					postgres research group
					uc berkeley
					mao@postgres.berkeley.edu

larned@uhccux.uhcc.Hawaii.Edu (Paul Larned) (01/05/91)

>Indeed.  Spot on.  Let me describe the scenario, as there may be a clever
>way of solving a specific problem which does not have a general solution.

>Our DBA, who has only been a DBA for a few months, was given the task of
>enlarging the storage space for our DB.  Having looked in the DB manuals,
>he decided that the easiest/safest way to do this would be to export all
>the tables and their associated data, and then replace the old DB with a
>new and larger version.

>When setting up the new DB, he included all the system tables and data.
>He then imported the old tables and data.  This meant that some of the
>system tuples were loaded twice.  So when doing, for example, a 'help'
>each help topic appears twice.  At this point our DBA asked for help.  He
>spent three days going through the process above, and does not wish to do
>so again.  So I posted the query here, as it seemed like an interesting
>problem.

I have also run into this problem attempting to enlarge the database.  I
found that the problem was due to the order in which they asked you to run
catalog.ora and perform your dataimport.  If you run catalog.ora first, as
they suggest, you will create the system tables and insert the data.  Then,
when you do your data import, those tuples will be appended a second time.
If, however, you import your data first, the system tables will be
established.  Then, when catalog.ora runs, it first deletes the system
table if it finds it, then recreates it.  In this way, you get all the
system tables, but withough duplicate tuples.

I would suggest that you start over and reinitialize.  It's easier than
trying to find all the duplicates.

Paul Larned
HI State Dept. of Health
larned@uhccux.uhcc.hawaii.edu

ded@kps.UUCP (David Edwards /DP) (01/07/91)

In article <775@keele.keele.ac.uk> csa18@seq1.kl.ac.uk (R.J. Husmo) writes:
>In article <12224@sybase.sybase.com> you write:
>>In article <CIMSHOP!DAVIDM.90Dec21095239@uunet.UU.NET> cimshop!davidm@uunet.UU.NET (David S. Masterson) writes:
>>>>>>>> On 21 Dec 90 10:13:14 GMT, csa18@seq1.keele.ac.uk (R.J. Husmo) said:
>>>
>>>R.J.> Q: Is there a way to automatically delete duplicate tuples from a
>>>R.J.> table/all tables in a db using SQL?

Why not do the following:

Suppose F1 is the field that has the undesired duplicates in table T1.

1) Lock the table:

	select F1 from T1 X
	where rowid > (select min(rowid) from T1 where F1=X.F1)
	for update

2) delete 'em:

	delete from T1 X
	where rowid > (select min(rowid) from T1
		       where T1=X.T1)

I think that solves the problem. You can put an index on T1 to make it
go faster.  At the risk of stating the obvious put a unique index on it
after the deletion.

David Edwards
Kuwait Petroleum Sweden AB
Stockholm Sweden

ubiquity@cs.utexas.edu (Richard Hoffman) (01/07/91)

> Q: Is there a way to automatically delete duplicate tuples from a
> table/all tables in a db using SQL?

When you say "automatically", do you mean "whenever they occur, without
human intervention"?  If so, the answer is "no", but you can automatically
*prevent* duplicates using any of the following:

+ a UNIQUE index on one or more columns in the table
+ a UNIQUE clause on one or more column definitions
+ a PRIMARY KEY clause on the table definition

The first is widely available; the second is part of the ANS89 standard,
and the third is part of the Integrity Enhancement of ANS89.

If you simply want some SQL to delete duplicate tuples once they have
occured, you will need more than one SQL statement.
Create a duplicate of the table you want to clean up,
and then insert into the new table with a subselect that selects all the
distinct rows in the old table.  The way you construct this subselect
will vary depending on your product.  With OS/2 EE Database Manager,
the simplest way would be:
 
   INSERT INTO NEW ((SELECT * FROM OLD) UNION (SELECT * FROM OLD))

since UNION tosses duplicates.
-- 
Richard Hoffman             IBM Entry Systems Division            (512) 823-1822
1529 Ben Crenshaw Way
Austin, TX 78746         "Life is a gamble at terrible odds; 
(512) 327-9232            if it were a bet you wouldn't take it"  (Tom Stoppard)

ddruker@.com (Daniel Druker) (01/09/91)

I've extracted the following file from Oracle's On-Line Support System.
Among other things, It contains a SQL Statement that will help 
eliminate your duplicate tuples. (I also think the one about finding 
the three highest paid employees is interesting) 

For those of you Oracle Customers out there who have purchased support, 
you can log on to the OLS system yourselves and peruse several hundred 
articles, helpful hints, and release notes at your leisure. All Oracle
supported customers may use this service. Contact your support 
representative for more info.

Regards, 

- Dan

Daniel Druker
Senior Sales Consultant 
Oracle Corporation

------------------
Please insert standard disclaimer here...
------------------

SQL*Plus Tricks
Dana Hausman
VMS Support Group
Oracle Worldwide Support


                           SQL*Plus Tricks
                           ---------------


  Q: How can I pass a parameter to SQL*Plus which contains spaces?

  A: For example, if the SQL script TEST.SQL looks like:

       SELECT '&1'
       FROM dual;

     Then you can use either of the following syntax to pass a value
     which contains spaces as a parameter:

       SQLPLUS scott/tiger @test "'This is a test.'"
       SQLPLUS scott/tiger @test """This is a test"""

     This is a VMS-specific workaround.


  Q: How can I assign unique values to rows in a table?

  A: V5: Create a new column (SEQNO, for example) which is NOT NULL.
         Then use the pseudo-column ROWNUM as follows:

           UPDATE table_name SET seqno = rownum;

     V6:   CREATE SEQUENCE sequence_name START WITH 1 INCREMENT BY 1;
           UPDATE table_name SET seqno = sequence_name.NEXTVAL;

         Finally, a unique index should be created on this column.


  Q: How can I perform a query like "Find the 3 highest paid
     employees?"

  A: Using the EMP table as an example:

       SELECT ename, sal
       FROM emp A
       WHERE 3 > (SELECT count(*) FROM emp B WHERE B.sal > A.sal)
       ORDER BY sal desc;

     This query selects an employee record if there are less than 3
     employees with higher salaries.


  Q: How can I delete duplicate rows from a table?

  A: There are 2 ways to do this:

       1. CREATE TABLE emp2 AS SELECT distinct * FROM emp;
          DROP TABLE emp;
          RENAME emp2 TO emp;

       2. Step #1 below is necessary in ORACLE v5 because to use the
          pseudo-column ROWID to update or delete a row, the row must
          first be selected for update to obtain the required lock. If
          this step is not performed, in ORACLE v5 you will get the
          error ORA-0061: row updated or deleted by ROWID must first
          be read for update.

            Step #1:

              SELECT empno FROM emp A
              WHERE rowid > (SELECT min(rowid) FROM emp B
                             WHERE B.empno = A.empno)
              FOR UPDATE OF empno;

            Step #2:

              DELETE FROM emp A
              WHERE rowid > (SELECT min(rowid) FROM emp B
                              WHERE B.empno = A.empno);


  Q: How can I have NULL values appear first in v6?

  A: In v5 NULL values were sorted lowest on ascending sequences.
     This behavior has been reversed in v6, but can be changed using
     the NVL command by replacing the NULL value with a very low
     value. For example,

       SELECT ename, sal, comm
       FROM emp
       ORDER BY nvl(comm, -999);

----------------------------------------------------------------------
Document #99246.503      OLS Bulletin Board       Revised: 24-MAY-1990