csa18@seq1.keele.ac.uk (R.J. Husmo) (12/21/90)
Q: Is there a way to automatically delete duplicate tuples from a table/all tables in a db using SQL? And why do relational dbs allow duplicate tuples, anyway?
davidm@uunet.UU.NET (David S. Masterson) (12/22/90)
>>>>> On 21 Dec 90 10:13:14 GMT, csa18@seq1.keele.ac.uk (R.J. Husmo) said:
R.J.> Q: Is there a way to automatically delete duplicate tuples from a
R.J.> table/all tables in a db using SQL?
The typical way is to put a unique index on the primary key of each table.
R.J.> And why do relational dbs allow duplicate tuples, anyway?
A tradeoff between performance and soundness of model. Without an index on
the primary key, its typically difficult to determine that a duplicate row has
been entered into the system with sequentially searching the table. This can
be a rather time-consuming operation for every update to take. The same
effect can be achieved by a unique index or constraint on the primary key of
the table.
--
====================================================================
David Masterson Consilium, Inc.
(415) 691-6311 640 Clyde Ct.
uunet!cimshop!davidm Mtn. View, CA 94043
====================================================================
"If someone thinks they know what I said, then I didn't say it!"
ben@vamp.sybase.com (ben ullrich) (12/30/90)
In article <CIMSHOP!DAVIDM.90Dec21095239@uunet.UU.NET> cimshop!davidm@uunet.UU.NET (David S. Masterson) writes: >>>>>> On 21 Dec 90 10:13:14 GMT, csa18@seq1.keele.ac.uk (R.J. Husmo) said: > >R.J.> Q: Is there a way to automatically delete duplicate tuples from a >R.J.> table/all tables in a db using SQL? > >The typical way is to put a unique index on the primary key of each table. i don't think this answers his/her question. it sounds to me like s/he has duplicates in his/her tables and needs to get rid of them. trying to add a unique index won't do it, since the duplicates have to be gone before the index may be successfully built. i agree that using a unique index on a key in any given table is the best way to go, as it then also conforms to normal form rules. in short, there is no *automatic* method to remove duplicates, short of programming your database engine to silently reject duplicates as they are inserted, or by letting a unique index raise an exception in the same instance. you can, however, run queries against your database to expose and/or clean out the duplicates as you see fit. the following article was posted just about a year ago, when this question was last asked in this forum. i hope mr. may doesn't mind its reposting without his permission. |>Article 3144 of comp.databases: |>Path: sybase!mtxinu!ucbvax!tut.cis.ohio-state.edu!purdue!decwrl!shlump.nac.dec.com!arkham.enet.dec.com!nmeser!may |>From: may@28182.dec.com (Patrick May) |>Newsgroups: comp.databases |>Subject: Re: SQL query for duplicates |>Message-ID: <36@arkham.enet.dec.com> |>Date: 22 Dec 89 17:16:58 GMT |>Sender: news@arkham.enet.dec.com |>Organization: Digital Equipment Corporation |> |> I can think of two ways to remove duplicates from an Oracle database |>table. The first will work with any SQL-based database: |> |> 1. Create a temporary table with the same structure as the |> one containing the duplicates. |> |> 2. INSERT INTO new_tab |> SELECT DISTINCT * |> FROM dup_tab; |> |> 3. Delete all records from dup_tab and refill it from new_tab. |> |> This works if entire rows are duplicated. If you just want to find |>duplicate values in a particular field, the following should be helpful: |> |> SELECT DISTINCT col_name, count(*) |> FROM dup_tab |> GROUP BY col_name |> HAVING count(*) > 1; |> |> This will display all duplicate values for col_name and the number of |>times that value appears in dup_tab. ..ben ------ ben ullrich only i do the talking here -- not my employer. ben@sybase.com {pyramid,pacbell,sun,lll-tis}!sybase!ben "why waste time learning when ignorance is instantaneous?"-hobbes on calvin
csa18@seq1.keele.ac.uk (R.J. Husmo) (01/02/91)
In article <12224@sybase.sybase.com> you write: >In article <CIMSHOP!DAVIDM.90Dec21095239@uunet.UU.NET> cimshop!davidm@uunet.UU.NET (David S. Masterson) writes: >>>>>>> On 21 Dec 90 10:13:14 GMT, csa18@seq1.keele.ac.uk (R.J. Husmo) said: >> >>R.J.> Q: Is there a way to automatically delete duplicate tuples from a >>R.J.> table/all tables in a db using SQL? >> >>The typical way is to put a unique index on the primary key of each table. > >i don't think this answers his/her question. it sounds to me like s/he has >duplicates in his/her tables and needs to get rid of them. [...] Indeed. Spot on. Let me describe the scenario, as there may be a clever way of solving a specific problem which does not have a general solution. Our DBA, who has only been a DBA for a few months, was given the task of enlarging the storage space for our DB. Having looked in the DB manuals, he decided that the easiest/safest way to do this would be to export all the tables and their associated data, and then replace the old DB with a new and larger version. When setting up the new DB, he included all the system tables and data. He then imported the old tables and data. This meant that some of the system tuples were loaded twice. So when doing, for example, a 'help' each help topic appears twice. At this point our DBA asked for help. He spent three days going through the process above, and does not wish to do so again. So I posted the query here, as it seemed like an interesting problem. We are using ORACLE RDBMS V6.0.27.8.2, by the way. >|> >|> 1. Create a temporary table with the same structure as the >|> one containing the duplicates. >|> >|> 2. INSERT INTO new_tab >|> SELECT DISTINCT * >|> FROM dup_tab; >|> >|> 3. Delete all records from dup_tab and refill it from new_tab. >|> We thought of doing this, but there are some problems: We do not know the format of the tables involved Nor do we know the name of all the tables involved, although I suspect that we can quite easily find out There are rather a lot of tables involved (61 candidates) We thought of some other ways of doing it, too, some of which included some rather creative shell script writing, but none seemed to automate the process sufficiently (As you pointed out in your posting). The more I think about it, the more I think I'll convince our DBA to put off writing his thesis and do it again. Properly, this time. Reidar.
IAN@SLACVM.SLAC.STANFORD.EDU (01/03/91)
It was suggested that duplicate tables be created, and then an INSERT statement with SELECT DISTINCT be used. The problem was that the structure of the tables involved wasn't known, and there were quite a few of them. Why not use a statement akin to Create table nodupes as select distinct * from table_with_dupes; This way you do not have to know the structure beforehand. I also did some experimenting. I just tried this...haven't checked it out thoroughly... Let's say your table with duplicates looks like this SQL> SELECT * FROM DELTEST; COL1 COL2 ----- ----- ABCDE UVXYZ ABCDE UVXYZ ABCDE UVXYZ ABCDE UVXYZ FGHIJ QPRST FGHIJ QPRST FGHIJ QPRST and COL1 is the key to the table. Create a temporary table thus create table temp(row_id,col11) as select rowid,col1 from deltest so you now have SQL> SELECT * FROM TEMP; ROW_ID COL1 ------------------ ----- 00000940.0000.0006 ABCDE 00000940.0001.0006 ABCDE 00000940.0002.0006 ABCDE 00000940.0003.0006 ABCDE 00000940.0004.0006 FGHIJ 00000940.0005.0006 FGHIJ 00000940.0006.0006 FGHIJ 7 rows selected. obviously your values for row_id are going to differ. Now issue the following command DELETE FROM DELTEST WHERE DELTEST.COL1 IN (SELECT COL1 FROM TEMP WHERE TEMP.ROW_ID < DELTEST.ROWID) / checking the deltest table now reveals SQL> select * from deltest; COL1 COL2 ----- ----- ABCDE UVXYZ FGHIJ QPRST I much prefer the first suggestion, as the latter one requires that you know the "unique" key to the table. Ian MacGregor Stanford Linear Accelerator Center
mao@eden.Berkeley.EDU (Mike Olson) (01/03/91)
In <775@keele.keele.ac.uk>, csa18@seq1.keele.ac.uk (R.J. Husmo) writes
(regarding the removal of duplicates from tables under oracle)
> We do not know the format of the tables involved
the query
SELECT * FROM FOO WHERE 1 = 0
returns the target list for queries on foo; although it doesn't solve your
whole problem, it is a useful query to know when you want to know what a
table contains without looking at any tuples.
mike olson
postgres research group
uc berkeley
mao@postgres.berkeley.edu
larned@uhccux.uhcc.Hawaii.Edu (Paul Larned) (01/05/91)
>Indeed. Spot on. Let me describe the scenario, as there may be a clever >way of solving a specific problem which does not have a general solution. >Our DBA, who has only been a DBA for a few months, was given the task of >enlarging the storage space for our DB. Having looked in the DB manuals, >he decided that the easiest/safest way to do this would be to export all >the tables and their associated data, and then replace the old DB with a >new and larger version. >When setting up the new DB, he included all the system tables and data. >He then imported the old tables and data. This meant that some of the >system tuples were loaded twice. So when doing, for example, a 'help' >each help topic appears twice. At this point our DBA asked for help. He >spent three days going through the process above, and does not wish to do >so again. So I posted the query here, as it seemed like an interesting >problem. I have also run into this problem attempting to enlarge the database. I found that the problem was due to the order in which they asked you to run catalog.ora and perform your dataimport. If you run catalog.ora first, as they suggest, you will create the system tables and insert the data. Then, when you do your data import, those tuples will be appended a second time. If, however, you import your data first, the system tables will be established. Then, when catalog.ora runs, it first deletes the system table if it finds it, then recreates it. In this way, you get all the system tables, but withough duplicate tuples. I would suggest that you start over and reinitialize. It's easier than trying to find all the duplicates. Paul Larned HI State Dept. of Health larned@uhccux.uhcc.hawaii.edu
ded@kps.UUCP (David Edwards /DP) (01/07/91)
In article <775@keele.keele.ac.uk> csa18@seq1.kl.ac.uk (R.J. Husmo) writes: >In article <12224@sybase.sybase.com> you write: >>In article <CIMSHOP!DAVIDM.90Dec21095239@uunet.UU.NET> cimshop!davidm@uunet.UU.NET (David S. Masterson) writes: >>>>>>>> On 21 Dec 90 10:13:14 GMT, csa18@seq1.keele.ac.uk (R.J. Husmo) said: >>> >>>R.J.> Q: Is there a way to automatically delete duplicate tuples from a >>>R.J.> table/all tables in a db using SQL? Why not do the following: Suppose F1 is the field that has the undesired duplicates in table T1. 1) Lock the table: select F1 from T1 X where rowid > (select min(rowid) from T1 where F1=X.F1) for update 2) delete 'em: delete from T1 X where rowid > (select min(rowid) from T1 where T1=X.T1) I think that solves the problem. You can put an index on T1 to make it go faster. At the risk of stating the obvious put a unique index on it after the deletion. David Edwards Kuwait Petroleum Sweden AB Stockholm Sweden
ubiquity@cs.utexas.edu (Richard Hoffman) (01/07/91)
> Q: Is there a way to automatically delete duplicate tuples from a > table/all tables in a db using SQL? When you say "automatically", do you mean "whenever they occur, without human intervention"? If so, the answer is "no", but you can automatically *prevent* duplicates using any of the following: + a UNIQUE index on one or more columns in the table + a UNIQUE clause on one or more column definitions + a PRIMARY KEY clause on the table definition The first is widely available; the second is part of the ANS89 standard, and the third is part of the Integrity Enhancement of ANS89. If you simply want some SQL to delete duplicate tuples once they have occured, you will need more than one SQL statement. Create a duplicate of the table you want to clean up, and then insert into the new table with a subselect that selects all the distinct rows in the old table. The way you construct this subselect will vary depending on your product. With OS/2 EE Database Manager, the simplest way would be: INSERT INTO NEW ((SELECT * FROM OLD) UNION (SELECT * FROM OLD)) since UNION tosses duplicates. -- Richard Hoffman IBM Entry Systems Division (512) 823-1822 1529 Ben Crenshaw Way Austin, TX 78746 "Life is a gamble at terrible odds; (512) 327-9232 if it were a bet you wouldn't take it" (Tom Stoppard)
ddruker@.com (Daniel Druker) (01/09/91)
I've extracted the following file from Oracle's On-Line Support System. Among other things, It contains a SQL Statement that will help eliminate your duplicate tuples. (I also think the one about finding the three highest paid employees is interesting) For those of you Oracle Customers out there who have purchased support, you can log on to the OLS system yourselves and peruse several hundred articles, helpful hints, and release notes at your leisure. All Oracle supported customers may use this service. Contact your support representative for more info. Regards, - Dan Daniel Druker Senior Sales Consultant Oracle Corporation ------------------ Please insert standard disclaimer here... ------------------ SQL*Plus Tricks Dana Hausman VMS Support Group Oracle Worldwide Support SQL*Plus Tricks --------------- Q: How can I pass a parameter to SQL*Plus which contains spaces? A: For example, if the SQL script TEST.SQL looks like: SELECT '&1' FROM dual; Then you can use either of the following syntax to pass a value which contains spaces as a parameter: SQLPLUS scott/tiger @test "'This is a test.'" SQLPLUS scott/tiger @test """This is a test""" This is a VMS-specific workaround. Q: How can I assign unique values to rows in a table? A: V5: Create a new column (SEQNO, for example) which is NOT NULL. Then use the pseudo-column ROWNUM as follows: UPDATE table_name SET seqno = rownum; V6: CREATE SEQUENCE sequence_name START WITH 1 INCREMENT BY 1; UPDATE table_name SET seqno = sequence_name.NEXTVAL; Finally, a unique index should be created on this column. Q: How can I perform a query like "Find the 3 highest paid employees?" A: Using the EMP table as an example: SELECT ename, sal FROM emp A WHERE 3 > (SELECT count(*) FROM emp B WHERE B.sal > A.sal) ORDER BY sal desc; This query selects an employee record if there are less than 3 employees with higher salaries. Q: How can I delete duplicate rows from a table? A: There are 2 ways to do this: 1. CREATE TABLE emp2 AS SELECT distinct * FROM emp; DROP TABLE emp; RENAME emp2 TO emp; 2. Step #1 below is necessary in ORACLE v5 because to use the pseudo-column ROWID to update or delete a row, the row must first be selected for update to obtain the required lock. If this step is not performed, in ORACLE v5 you will get the error ORA-0061: row updated or deleted by ROWID must first be read for update. Step #1: SELECT empno FROM emp A WHERE rowid > (SELECT min(rowid) FROM emp B WHERE B.empno = A.empno) FOR UPDATE OF empno; Step #2: DELETE FROM emp A WHERE rowid > (SELECT min(rowid) FROM emp B WHERE B.empno = A.empno); Q: How can I have NULL values appear first in v6? A: In v5 NULL values were sorted lowest on ascending sequences. This behavior has been reversed in v6, but can be changed using the NVL command by replacing the NULL value with a very low value. For example, SELECT ename, sal, comm FROM emp ORDER BY nvl(comm, -999); ---------------------------------------------------------------------- Document #99246.503 OLS Bulletin Board Revised: 24-MAY-1990