[comp.sources.wanted] removing duplicate lines from a text file???

stevep@dgp.toronto.edu (Steve Portigal) (03/26/90)

    Is there any simple way to remove duplicate lines from a text file?  If
anyone could suggest even what command to use, I would appreciate it.
  thanks very much, Steve Portigal
-- 
******************************************************************
I can't be held repsonsible for anything...I am brain dead from all
this terminal time.
******************************************************************
"Black and Huge" -- GWAR

pbiron@weber.ucsd.edu (Paul Biron) (03/26/90)

In article <1990Mar25.182039.25565@jarvis.csri.toronto.edu> stevep@dgp.toronto.edu (Steve Portigal) writes:
>
>    Is there any simple way to remove duplicate lines from a text file?  If
>anyone could suggest even what command to use, I would appreciate it.
>  thanks very much, Steve Portigal
>-- 
>******************************************************************
>I can't be held repsonsible for anything...I am brain dead from all
>this terminal time.
>******************************************************************
>"Black and Huge" -- GWAR

The easiest way is
    sort -u orig_file > new_file

The -u flag to sort(1) says "keep only the Unigue lines".


Paul Biron     (pbiron@ucsd.edu)       (619) 534-5758
Social Sciences DataBase Project, Central University Library
University of California, San Diego
La Jolla, Ca. 92093

guy@auspex.auspex.com (Guy Harris) (03/27/90)

>>    Is there any simple way to remove duplicate lines from a text file?

	...

>The easiest way is
>    sort -u orig_file > new_file

Assuming, of course, that the order of the lines in the file isn't
important.

moraes@cs.toronto.edu (Mark Moraes) (03/27/90)

>>>    Is there any simple way to remove duplicate lines from a text file?
>>    sort -u orig_file > new_file
>Assuming, of course, that the order of the lines in the file isn't
>important.

In that case, perhaps something like

awk '{printf "%8d %s\n", NR, $0}' | 
	sort -u +1 | 
	sort -n | 
	sed 's/^.........//'

assuming, of course, that none of the lines are longer than the
maximum lengths your awk/sed can handle.

mjr@atreus.umiacs.umd.edu (03/27/90)

	read the man page for uniq(1)
mjr.
--
    Noblest lives and noblest dies who makes and keeps his self-made laws.
        - Sir Richard Burton.

tchrist@convex.COM (Tom Christiansen) (03/27/90)

In article <90Mar26.232441est.2199@smoke.cs.toronto.edu> moraes@cs.toronto.edu (Mark Moraes) writes:
>>>>    Is there any simple way to remove duplicate lines from a text file?
>>>    sort -u orig_file > new_file
>>Assuming, of course, that the order of the lines in the file isn't
>>important.
>
>In that case, perhaps something like
>
>awk '{printf "%8d %s\n", NR, $0}' | 
>	sort -u +1 | 
>	sort -n | 
>	sed 's/^.........//'
>
>assuming, of course, that none of the lines are longer than the
>maximum lengths your awk/sed can handle.

1.  This seems like truly massive overkill.
2.  That max line length thing can be a real bitch.

If your duplicate lines are adjacent, just use uniq(1).
If not, this seems much clearer and cheaper:

    perl -ne 'print unless $seen{$_}++;' 

If the duplicate lines ARE adjacent AND you don't have uniq(1) AND
you don't want to chew up as much memory as the previous line, do this:

    perl -ne 'print $last = $_ unless $_ eq $last;'

Perl doesn't have the silly arbitrary line-length restrictions 
of sed and awk, and the code is often much clearer: compare
the logic of the awk/sort/sort/sed example with that of the perl ones.

Followups either to comp.lang.perl or alt.religion.computers, 
depending on your religion. :-)


--tom
--

    Tom Christiansen                       {uunet,uiucdcs,sun}!convex!tchrist 
    Convex Computer Corporation                            tchrist@convex.COM
		 "EMACS belongs in <sys/errno.h>: Editor too big!"

cf@kcl-cs.UUCP (Andy Whitcroft) (03/29/90)

Howa about 'uniq' as in ...

	uniq <ifile >ofile

... you may have some other constraint on the data to remove, but if I get it 
right this should work.

Andy.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Andy Whitcroft |  UUCP: ...!kcl-cs!cf            | Dept. of Computing,
        (C.f.) | JANET: cf%kcl-cs.UUCP@uk.ac.ukc |      Kings College London.

tkevans@fallst.UUCP (Tim Evans) (03/29/90)

In article <90Mar26.232441est.2199@smoke.cs.toronto.edu>, moraes@cs.toronto.edu (Mark Moraes) writes:
> >>>    Is there any simple way to remove duplicate lines from a text file?
> >>    sort -u orig_file > new_file
> 
From TFM:

	uniq reads the input file comparing adjacent lines.  In the
	normal case, the second and succeeding copies of repeated
	lines are removed; the remainder is written on the output file.

Note that this says _adjacent_ lines.

-- 
UUCP:		{rutgers|ames|uunet}!mimsy!woodb!fallst!tkevans
INTERNET:	tkevans%fallst@wb3ffv.ampr.org
Tim Evans	2201 Brookhaven Ct, Fallston, MD 21047  (301) 965-3286