[comp.lang.perl] uniq'ing arrays

grady@fx.com (Steven Grady) (11/21/90)

grep is a useful unix utility that has a parallel perl operator.
What about uniq?  I'd like to see uniq (or something) that strips
out the duplicates in an array.  It would be cool if it were
built into perl.

Meanwhile, has anyone written an efficient routine to do this?
Linear time would be nice.  I'd like one that doesn't require the
array to be sorted.  I could write one myself, but I thought
I'd avoid duplicating code..
-- 
	Steven
	grady@fx.com
Birds of prey know they're cool.

chris@utgard.uucp (Chris Anderson) (11/21/90)

In article <1990Nov21.021344.16038@fxgrp.fx.com> grady@postgres.berkeley.edu writes:
>grep is a useful unix utility that has a parallel perl operator.
>What about uniq?  I'd like to see uniq (or something) that strips
>out the duplicates in an array.  It would be cool if it were
>built into perl.

I asked this myself several weeks ago ... here's the answer I
used from the ones presented by the net.

sub uniq {
    local (@in) = @_;
    local (%ary, @out);

    undef %ary;
    @ary{@in} = ();
    @out = keys(%ary);

    @out;
}

Hope this helps

Chris

tchrist@convex.COM (Tom Christiansen) (11/21/90)

In article <1990Nov21.021344.16038@fxgrp.fx.com> grady@postgres.berkeley.edu writes:
=grep is a useful unix utility that has a parallel perl operator.
=What about uniq?  I'd like to see uniq (or something) that strips
=out the duplicates in an array.  It would be cool if it were
=built into perl.

=Meanwhile, has anyone written an efficient routine to do this?
=Linear time would be nice.  I'd like one that doesn't require the
=array to be sorted.  I could write one myself, but I thought
=I'd avoid duplicating code..

This is turning into a FAQ, isn't it?

In <9952@jpl-devvax.JPL.NASA.GOV> on 12 Oct 90, Larry wrote a good 
article on this.  He covers these case:

1) If @in is sorted:

    $prev = 'nonesuch';
    @out = grep($_ ne $prev && (($prev) = $_), @in);

2) If we don't know whether @in is sorted:

    undef %saw;
    @out = grep(!$saw{$_}++, @in);

3) If we don't know if @in is sorted, nor case whether @out is:

    undef %ary;
    @ary{@in} = ();
    @out = keys(%ary);

(I usually use #3.)

And then he points out that if you know that @in contains
only small positive integers, you can use:

    @out = grep(!$saw[$_]++, @in);

for case 2, and for case 3:
    
    @ary[@in] = @in;
    @out = sort @ary;


I guess I'll add it to the FAQ.


--tom

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (11/22/90)

In article <1990Nov21.021344.16038@fxgrp.fx.com> grady@postgres.berkeley.edu writes:
: grep is a useful unix utility that has a parallel perl operator.
: What about uniq?  I'd like to see uniq (or something) that strips
: out the duplicates in an array.  It would be cool if it were
: built into perl.
: 
: Meanwhile, has anyone written an efficient routine to do this?
: Linear time would be nice.  I'd like one that doesn't require the
: array to be sorted.  I could write one myself, but I thought
: I'd avoid duplicating code..

sub uniq {
    local(%seen);
    grep(!$seen{$_}++, @_);
}

@out = &uniq(@in);

It's more-or-less linear, apart from the extra time to bifurcate %seen
occasionally as it's growing.  I don't know whether you want to call it
efficient--it's going to depend on how quickly your machine can calculate
the hash function.  It may be faster to sort and compare adjacent values,
depending...

Larry

jgk@osc.COM (Joe Keane) (11/30/90)

In article <1990Nov21.021344.16038@fxgrp.fx.com> grady@postgres.berkeley.edu
writes:
>grep is a useful unix utility that has a parallel perl operator.
>What about uniq?  I'd like to see uniq (or something) that strips
>out the duplicates in an array.  It would be cool if it were
>built into perl.

In article <10501@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV
(Larry Wall) writes:
>sub uniq {
>    local(%seen);
>    grep(!$seen{$_}++, @_);
>}

Note that the Unix `uniq' utility is defined to remove _adjacent_ duplicate
lines, which is an easier task than removing _all_ duplicate lines.  I don't
know which the original poster wanted.

>It's more-or-less linear, apart from the extra time to bifurcate %seen
>occasionally as it's growing.

Why not linear?  You increase the size exponentially, don't you?