stevep@dgp.toronto.edu (Steve Portigal) (03/26/90)
Is there any simple way to remove duplicate lines from a text file? If anyone could suggest even what command to use, I would appreciate it. thanks very much, Steve Portigal -- ****************************************************************** I can't be held repsonsible for anything...I am brain dead from all this terminal time. ****************************************************************** "Black and Huge" -- GWAR
pbiron@weber.ucsd.edu (Paul Biron) (03/26/90)
In article <1990Mar25.182039.25565@jarvis.csri.toronto.edu> stevep@dgp.toronto.edu (Steve Portigal) writes: > > Is there any simple way to remove duplicate lines from a text file? If >anyone could suggest even what command to use, I would appreciate it. > thanks very much, Steve Portigal >-- >****************************************************************** >I can't be held repsonsible for anything...I am brain dead from all >this terminal time. >****************************************************************** >"Black and Huge" -- GWAR The easiest way is sort -u orig_file > new_file The -u flag to sort(1) says "keep only the Unigue lines". Paul Biron (pbiron@ucsd.edu) (619) 534-5758 Social Sciences DataBase Project, Central University Library University of California, San Diego La Jolla, Ca. 92093
guy@auspex.auspex.com (Guy Harris) (03/27/90)
>> Is there any simple way to remove duplicate lines from a text file? ... >The easiest way is > sort -u orig_file > new_file Assuming, of course, that the order of the lines in the file isn't important.
moraes@cs.toronto.edu (Mark Moraes) (03/27/90)
>>> Is there any simple way to remove duplicate lines from a text file? >> sort -u orig_file > new_file >Assuming, of course, that the order of the lines in the file isn't >important. In that case, perhaps something like awk '{printf "%8d %s\n", NR, $0}' | sort -u +1 | sort -n | sed 's/^.........//' assuming, of course, that none of the lines are longer than the maximum lengths your awk/sed can handle.
mjr@atreus.umiacs.umd.edu (03/27/90)
read the man page for uniq(1) mjr. -- Noblest lives and noblest dies who makes and keeps his self-made laws. - Sir Richard Burton.
tchrist@convex.COM (Tom Christiansen) (03/27/90)
In article <90Mar26.232441est.2199@smoke.cs.toronto.edu> moraes@cs.toronto.edu (Mark Moraes) writes: >>>> Is there any simple way to remove duplicate lines from a text file? >>> sort -u orig_file > new_file >>Assuming, of course, that the order of the lines in the file isn't >>important. > >In that case, perhaps something like > >awk '{printf "%8d %s\n", NR, $0}' | > sort -u +1 | > sort -n | > sed 's/^.........//' > >assuming, of course, that none of the lines are longer than the >maximum lengths your awk/sed can handle. 1. This seems like truly massive overkill. 2. That max line length thing can be a real bitch. If your duplicate lines are adjacent, just use uniq(1). If not, this seems much clearer and cheaper: perl -ne 'print unless $seen{$_}++;' If the duplicate lines ARE adjacent AND you don't have uniq(1) AND you don't want to chew up as much memory as the previous line, do this: perl -ne 'print $last = $_ unless $_ eq $last;' Perl doesn't have the silly arbitrary line-length restrictions of sed and awk, and the code is often much clearer: compare the logic of the awk/sort/sort/sed example with that of the perl ones. Followups either to comp.lang.perl or alt.religion.computers, depending on your religion. :-) --tom -- Tom Christiansen {uunet,uiucdcs,sun}!convex!tchrist Convex Computer Corporation tchrist@convex.COM "EMACS belongs in <sys/errno.h>: Editor too big!"
cf@kcl-cs.UUCP (Andy Whitcroft) (03/29/90)
Howa about 'uniq' as in ... uniq <ifile >ofile ... you may have some other constraint on the data to remove, but if I get it right this should work. Andy. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Andy Whitcroft | UUCP: ...!kcl-cs!cf | Dept. of Computing, (C.f.) | JANET: cf%kcl-cs.UUCP@uk.ac.ukc | Kings College London.
tkevans@fallst.UUCP (Tim Evans) (03/29/90)
In article <90Mar26.232441est.2199@smoke.cs.toronto.edu>, moraes@cs.toronto.edu (Mark Moraes) writes: > >>> Is there any simple way to remove duplicate lines from a text file? > >> sort -u orig_file > new_file > From TFM: uniq reads the input file comparing adjacent lines. In the normal case, the second and succeeding copies of repeated lines are removed; the remainder is written on the output file. Note that this says _adjacent_ lines. -- UUCP: {rutgers|ames|uunet}!mimsy!woodb!fallst!tkevans INTERNET: tkevans%fallst@wb3ffv.ampr.org Tim Evans 2201 Brookhaven Ct, Fallston, MD 21047 (301) 965-3286