[comp.unix.questions] Need help ** removing duplicate rows **

c60b-3ac@web.berkeley.edu (Eric Thompson) (10/31/90)

I have a few very long files that contain rows of ASCII data.  Each row
looks something like this (not the actual data here):

a:A:b:c:d:e:f:g:h:i:j:k:l:m
a:B:b:c:d:e:f:g:h:i:j:k:l:m
a:C:b:c:d:e:f:g:h:i:j:k:l:m
a:D:b:c:d:e:f:g:h:i:j:k:l:m
b:A:n:o:p:q:s:t:u:v:w:x:y:z
c:A:x:a:x:b:x:c:d:a:m:l:v:x
d:A:m:l:k:j:i:h:g:f:e:d:c:b
d:B:m:l:k:j:i:h:g:f:e:d:c:b
d:C:m:l:k:j:i:h:g:f:e:d:c:b

It's the second column that's important.  If there are multiple rows that
are exactly the same except for the second column, I want to GET RID of them.
If the row is unique (for example, the ones starting with "b" and "c" above)
then it should stay.  Sounds like what I need is a way to filter out rows
that are duplicate except in the second column.

Any hints?  I'll take anything, really.  Please MAIL your replies, since I
doubt this is of general interest.  Thanks again.

Eric Thompson    c60b-3ac@web.berkeley.edu  ...!ucbvax!web!c60b-3ac

merlyn@iwarp.intel.com (Randal Schwartz) (10/31/90)

In article <1990Oct30.234654.23547@agate.berkeley.edu>, c60b-3ac@web (Eric Thompson) writes:
| I have a few very long files that contain rows of ASCII data.  Each row
| looks something like this (not the actual data here):
| 
| a:A:b:c:d:e:f:g:h:i:j:k:l:m
| a:B:b:c:d:e:f:g:h:i:j:k:l:m
| a:C:b:c:d:e:f:g:h:i:j:k:l:m
| a:D:b:c:d:e:f:g:h:i:j:k:l:m
| b:A:n:o:p:q:s:t:u:v:w:x:y:z
| c:A:x:a:x:b:x:c:d:a:m:l:v:x
| d:A:m:l:k:j:i:h:g:f:e:d:c:b
| d:B:m:l:k:j:i:h:g:f:e:d:c:b
| d:C:m:l:k:j:i:h:g:f:e:d:c:b
| 
| It's the second column that's important.  If there are multiple rows that
| are exactly the same except for the second column, I want to GET RID of them.
| If the row is unique (for example, the ones starting with "b" and "c" above)
| then it should stay.  Sounds like what I need is a way to filter out rows
| that are duplicate except in the second column.

A one-liner in Perl:

perl -ne '($a,$b,$c) = split(":",$_,3); print unless $seen{$a,$c}++;'

Fast enough?

print "Just another Perl hacker,"
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Intel put the 'backward' in 'backward compatible'..."=========/

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (10/31/90)

In article <1990Oct31.003627.641@iwarp.intel.com> merlyn@iwarp.intel.com (Randal Schwartz) writes:
: In article <1990Oct30.234654.23547@agate.berkeley.edu>, c60b-3ac@web (Eric Thompson) writes:
: | I have a few very long files that contain rows of ASCII data.  Each row
: | looks something like this (not the actual data here):
: | 
: | a:A:b:c:d:e:f:g:h:i:j:k:l:m
: | a:B:b:c:d:e:f:g:h:i:j:k:l:m
: | a:C:b:c:d:e:f:g:h:i:j:k:l:m
: | a:D:b:c:d:e:f:g:h:i:j:k:l:m
: | b:A:n:o:p:q:s:t:u:v:w:x:y:z
: | c:A:x:a:x:b:x:c:d:a:m:l:v:x
: | d:A:m:l:k:j:i:h:g:f:e:d:c:b
: | d:B:m:l:k:j:i:h:g:f:e:d:c:b
: | d:C:m:l:k:j:i:h:g:f:e:d:c:b
: | 
: | It's the second column that's important.  If there are multiple rows that
: | are exactly the same except for the second column, I want to GET RID of them.
: | If the row is unique (for example, the ones starting with "b" and "c" above)
: | then it should stay.  Sounds like what I need is a way to filter out rows
: | that are duplicate except in the second column.
: 
: A one-liner in Perl:
: 
: perl -ne '($a,$b,$c) = split(":",$_,3); print unless $seen{$a,$c}++;'
: 
: Fast enough?

Maybe, but he said they were very long files, and that may mean more than
you'd want to store in an associative array, even with virtual memory.
Presuming the files are sorted reasonably, you can get away with this:

perl -ne '($this = $_) =~ s/:[^:]*//; print if $this ne $that; $that = $this'

Of course, someone will post a solution using cut and uniq, which will be
fine if you don't mind losing the second field.  Or swapping the first
two fields around.  I'll leave the awk and sed solutions to someone else.

Larry

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (10/31/90)

In article <10182@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
> In article <1990Oct31.003627.641@iwarp.intel.com> merlyn@iwarp.intel.com (Randal Schwartz) writes:
> : In article <1990Oct30.234654.23547@agate.berkeley.edu>, c60b-3ac@web (Eric Thompson) writes:
      [ if multiple (consecutive?) rows of colon-separated columns ]
      [ have the same second column, scrap 'em ]
> : perl -ne '($a,$b,$c) = split(":",$_,3); print unless $seen{$a,$c}++;'
> : Fast enough?
  [ as happens with every Perl program posted to the net, Larry points ]
  [ out how inefficient this can be: ]
> Maybe, but he said they were very long files, and that may mean more than
> you'd want to store in an associative array, even with virtual memory.
> Presuming the files are sorted reasonably, you can get away with this:
> perl -ne '($this = $_) =~ s/:[^:]*//; print if $this ne $that; $that = $this'

That does look like what Eric was asking for, but what if the file is
not sorted? Is there a fast Perl solution?

> Of course, someone will post a solution using cut and uniq, which will be
> fine if you don't mind losing the second field.  Or swapping the first
> two fields around. 

cut? uniq? Why? There's already a tool perfectly matched to the job:

  sort -u -t: +0 -1 +2

sort already knows how to work in limited memory. If the input is
already sorted,

  sort -m -u -t: +0 -1 +2

should do the trick. Both of these solutions are easy to figure out, easy
to type, very fast even on long files, and quite portable.

> I'll leave the awk and sed solutions to someone else.

Yes, I seem to always be defending the classic tools against this
onslaught of Perl code that nobody but you can ever optimize.

---Dan

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (11/01/90)

In article <28220:Oct3105:18:3290@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
: > I'll leave the awk and sed solutions to someone else.
: 
: Yes, I seem to always be defending the classic tools against this
: onslaught of Perl code that nobody but you can ever optimize.

You're obviously too defensive.  :-)

I often post non-Perl solutions if I think they're appropriate.  And I
freely admit that I overlooked sort -u, which is, as you say, perfect for
the job.  The sort I grew up with didn't have -u, so I never seem to think
of it.  Dratted fossilized neurons...

And what's so amazing about me being better than other people with Perl?
I bet you're better than me with auth.  You push auth a little, I push
Perl a little, and the world becomes a better place.  If you consistently
take an antagonististic approach, however, people are going to start
thinking you're from New York.   :-)

Love,
Larry

c60b-3ac@e260-3d.berkeley.edu (Eric Thompson) (11/03/90)

A hearty THANK YOU to everyone who responded.  Just thought I'd let you
know that this is the solution I ended up using:

> From: tslwat!louk (Lou Kates)
> Subject: unique lines except for field 2
> 
> The following command will do it (on some systems you must say nawk
> instead of awk) where x.dat is the data above:
> 
> awk -F: '    { tmp = $0; $2 = ""; store[$0]=tmp; freq[$0]++} 
>          END { for(i in store) if (freq[i]==1) print store[i]}' x.dat

The answers using 'sort' were almost what I needed--but I didn't want to
save ANY occurences of lines that had duplicate information (probably my
fault for not being clear enough).  Again, thanks.  I appreciate it.

Eric Thompson               |  et@ocf.Berkeley.EDU
STONE ROSES & A'S BASEBALL  |  ...!ucbvax!ocf!et

weimer@ssd.kodak.com (Gary Weimer) (11/08/90)

In article <10182@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
>In article <1990Oct31.003627.641@iwarp.intel.com> merlyn@iwarp.intel.com (Randal Schwartz) writes:
>: In article <1990Oct30.234654.23547@agate.berkeley.edu>, c60b-3ac@web (Eric Thompson) writes:
>: | Sounds like what I need is a way to filter out rows
>: | that are duplicate except in the second column.
>: 
>: A one-liner in Perl:
>: 
>: perl -ne '($a,$b,$c) = split(":",$_,3); print unless $seen{$a,$c}++;'
>: 
>: Fast enough?
>
>Maybe, but he said they were very long files, and that may mean more than
>you'd want to store in an associative array, even with virtual memory.
>Presuming the files are sorted reasonably, you can get away with this:
>
>perl -ne '($this = $_) =~ s/:[^:]*//; print if $this ne $that; $that = $this'
>
>Of course, someone will post a solution using cut and uniq, which will be
>fine if you don't mind losing the second field.  Or swapping the first
>two fields around.  I'll leave the awk and sed solutions to someone else.

Who needs sed?

awk -F: '{cur=$1$3$4$5$6$7$8$9$10$11$12$13$14;if(cur!=prev){prev=cur;print $0}}'
InFile > OutFile

NOTE: split to fit in 80 columns--needs rejoined