[comp.lang.perl] Ordering one array by another

schaefer@ogicse.ogi.edu (Barton E. Schaefer) (03/27/90)

I have a sorted list of keywords, and a file (well, stdin actually, but
no matter) consisting of lines each of which contains at most one of the
keywords.  There may be keywords not in the file, and lines with zero
keywords (which are ignored), but no pair of lines both contain the same
keyword.  What I want to do is reorder the input lines to match the order
in the list of keywords.

So what's the fastest way?  What I have right now goes like this:

    @lines = <STDIN>;				# Whomp it in
    # Never mind how I got the sorted keywords, they're in @keys
    foreach $k (@keys) {			# For every keyword
	$i = $[;				# Reset the index
	foreach $l (@lines) {			# For every line
	    if ($l =~ /\b$k\b/) {		# If it matches
		print splice(@lines,$i,1);	# Remove and print it
		last;				# Done with this pass
	    }
	    $i++;				# Increment index
	}
    }

The idea being that the splice shortens the search for each succeeding
pass through the lines.

Of course Randal would point out that it could be:

    grep(($k = $_, $i = 0, $found = 0,
	grep((!$found && ++$i && /\b$k\b/ &&
		(print(splice(@lines,$i-1,1)), ++$found)),
	    @lines)),
	@keys);

I'm wondering if there isn't some trick to do this faster than either of
these, perhaps by mixing grep and foreach, or using an associative array.
Suggestions?
-- 
Bart Schaefer          "EARTH: Surrender IMMEDIATELY or we PICKLE DAN QUAYLE"

                                                                    "THPPFT!"
schaefer@cse.ogi.edu (used to be cse.ogc.edu)

merlyn@iwarp.intel.com (Randal Schwartz) (03/28/90)

In article <8251@ogicse.ogi.edu>, schaefer@ogicse (Barton E. Schaefer) writes:
| I have a sorted list of keywords, and a file (well, stdin actually, but
| no matter) consisting of lines each of which contains at most one of the
| keywords.  There may be keywords not in the file, and lines with zero
| keywords (which are ignored), but no pair of lines both contain the same
| keyword.  What I want to do is reorder the input lines to match the order
| in the list of keywords.
| 
| So what's the fastest way?  What I have right now goes like this:
| 
|     @lines = <STDIN>;				# Whomp it in
|     # Never mind how I got the sorted keywords, they're in @keys
|     foreach $k (@keys) {			# For every keyword
| 	$i = $[;				# Reset the index
| 	foreach $l (@lines) {			# For every line
| 	    if ($l =~ /\b$k\b/) {		# If it matches
| 		print splice(@lines,$i,1);	# Remove and print it
| 		last;				# Done with this pass
| 	    }
| 	    $i++;				# Increment index
| 	}
|     }

Too much newfangled technology.  Do it easy...

	$_ = "\n" . join("",<STDIN>); # slurrrrpp...
	for $k (@keys) {
		s/(\n)(.*\b$k\b.*\n)/print($2),$1/eg;
			# don't discard the previous newline "$1"
	}
	# at this point, $_ has "\n" followed by anything left...
	s/^\n//; print;

@p=split(//,",rekcah lreP rehtona tsuJ");p:print pop(p);goto p if @p;
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Welcome to Portland, Oregon, home of the California Raisins!"=/

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (03/28/90)

In article <8251@ogicse.ogi.edu> schaefer@ogicse.ogi.edu (Barton E. Schaefer) writes:
: I have a sorted list of keywords, and a file (well, stdin actually, but
: no matter) consisting of lines each of which contains at most one of the
: keywords.  There may be keywords not in the file, and lines with zero
: keywords (which are ignored), but no pair of lines both contain the same
: keyword.  What I want to do is reorder the input lines to match the order
: in the list of keywords.

We really don't have enough info to decide the fastest algorithm, because
we'd need to know the number of keywords, the relative number of lines not
containing any keyword, and the usual location of the keyword in each line,
and the frequency distribution of characters in the keys and lines.

However, on the assumption that pattern compilation and matching is going
to dominate the timing, I'd suspect the following of being QUITE fast:

$prog = <<EOF;
				    while (<STDIN>) { study;
EOF

foreach $key (@keys) {
    $prog .= <<EOF;
				        \$found{'$key'} = \$_ if ?\\b$key\\b?;
EOF
}

$prog .= <<EOF;
				    }
EOF

print $prog if $verbose;
eval $prog; die $@ if $@;
print grep($_,@found{@keys});

Several things to note.  First, it turns your algorithm inside out so that
it can study each line.  (This also avoids the "whomp" overhead.)  Second,
it generates a complete loop with the pattern matches so that it avoids
unnecessary evals or runtime pattern compilation inside the loop.  Third,
it uses the ?? search because each key only wants to match once, and there's
no point in looking for the key again once it has matched.

On the other hand, the associative array costs you something in storage and
efficiency.  Try it, and see if it's faster.

Larry

schaefer@ogicse.ogi.edu (Barton E. Schaefer) (03/28/90)

In article <7561@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
} In article <8251@ogicse.ogi.edu> schaefer@ogicse.ogi.edu (Barton E. Schaefer) writes:
} : I have a sorted list of keywords, and a file (well, stdin actually, but
} : no matter) consisting of lines each of which contains at most one of the
} : keywords.  There may be keywords not in the file, and lines with zero
} : keywords (which are ignored), but no pair of lines both contain the same
} : keyword.  What I want to do is reorder the input lines to match the order
} : in the list of keywords.
} 
} We really don't have enough info to decide the fastest algorithm, because
} we'd need to know the number of keywords, the relative number of lines not
} containing any keyword, and the usual location of the keyword in each line,
} and the frequency distribution of characters in the keys and lines.

The keyword is almost always the last word on the line.  Statistics on
a sample keys/lines pair, from wc:

	keys:         228     228    2646
	lines:        231    2076   15744

However, I the number of keywords and lines may vary a lot from trial to
trial, and I may need to do this many times for different keys/lines sets
in each run.  I do know that the number of keywords is almost always very
close to the number of lines, but there may be either a few more keys than
lines or a few more lines than keys.

} However, on the assumption that pattern compilation and matching is going
} to dominate the timing, I'd suspect the following of being QUITE fast:

[Algorithm deleted, it's included again below.]

It is indeed very fast (see below), but my tests revealed a possible bug
(see also below).

} On the other hand, the associative array costs you something in storage and
} efficiency.  Try it, and see if it's faster.

Here are some statistics for running Randal's suggested algorithm, my
original one, and Larry's over the keys/lines input sample wc'd above:

	Algorithm			Time in Seconds
	---------			---------------
	Randal's:			 45.85
	Randal's, using grep:		 45.75
	Original:			 40.60
	Original, as nested grep:	100.85
	Larry's:			 11.70
	Larry's, including build:	 11.85

Now, an important note here: these are all averaged over twenty runs,
all done by the program below, except for Larry's algorthm.  To test
Larry's, I had to take out the 1..20 loop and run the whole program
20 times, because any time I tried to clear the associative array used
by Larry's algorithm, successive runs wouldn't print anything!  And
if I didn't clear the array each time, every successive run would be
faster than the one before it, because it doesn't have to reallocate the
array each time.

Any idea why the "undef %found;" breaks this?  (Patchlevel 15.)

Here's the program:

#! /usr/bin/perl

# Whomp everything in ahead of time so tests aren't biased
# Then initialize locals from these values for each test

@keys = split("\n",`cat keys`);		# Read the keys
$init_lines = `cat lines`;		# Read the lines, in Randal's format
@init_lines = split("\n",$init_lines);	# Make an array of lines
$init_lines = "\n".$init_lines;		# Add the extra newline Randal needs

#
# Set up locals and time a single run of the current method.
# Sends the output to /dev/nul -- all we want are stats.
# Note that the time does not include initalizing the locals,
# nor any of the I/O manipulations.
#
sub Time_it {
    local($lines,@lines) = ($init_lines,@init_lines);
    open(HIDE, "> /dev/null") || die;
    select(HIDE);
    local($time) = time;
    do $method();
    $time = time - $time;
    close(HIDE);
    select(STDOUT);
    $time;
}

#
# Randal's search-and-replace algorithm
#
sub Randal {
    for $k (@keys) { $lines =~ s/(\n)(.*\b$k\b.*\n)/print($2),$1/e; }
}

#
# Randal's algorithm, grep replacing "for $k ..."
#
sub Randal_as_grep {
    grep(($lines =~ s/(\n)(.*\b$_\b.*\n)/print($2),$1/e), @keys);
}

#
# My original splicing algorithm
#
sub Original {
    for $k (@keys) {
	$i = $[;
	for $l (@lines) {
	    if ($l =~ /\b$k\b/) {
		print splice(@lines,$i,1)."\n";
		last;
	    }
	    $i++;
	}
    }
}

#
# The silly double-grep variant of the above, to show what NOT to do
#
sub Original_as_grep {
    grep(($k=$_,$i=0,$j=1,
	grep(($j && ++$i && /\b$k\b/ &&
		print(splice(@lines,$i-1,1)."\n"), --$j),
	    @lines)),
	@keys);
}

#
# Construct the on-the-fly program for Larry's algorithm.
# There's a little extra overhead for tacking back on the "\n" that
# was split() off to form @init_lines, but otherwise this is what
# Larry posted.  The idea is to construct an explicit test for each
# key (unroll the foreach $key loop), and then apply those tests to
# a line that has been prepared for search by the study command.
#
sub build_Larry {
$prog = <<EOF;
    while (\$_ = shift(\@lines)) { \$_ .= "\\n"; study;
EOF
foreach $key (@keys) {
    $prog .= <<EOF;
	\$found{'$key'} = \$_ if ?\\b$key\\b?;
EOF
}
$prog .= <<EOF;
    }
EOF
}

#
# Larry's algorithm -- eval the program and print the associative array
# (why the grep?  Isn't "print @found{@keys};" sufficient?).
# Note the problem with "undef %found" -- if that line is there, only
# the first pass through this algorithm prints anything, even though
# the eval'd program should be reconstructing $found{'key'} each time.
#
sub Larry {
    eval $prog; die $@ if $@;
    print grep($_,@found{@keys});
    # undef %found;			# This breaks 2nd+ passes
    # %found = ();			# This doesn't work either
}

#
# For fairness, we also check the time to rebuild the program each pass.
# 
sub Larry_with_build {
    &build_Larry;
    &Larry;
    undef $prog;
}

for $method (
		'Randal','Randal_as_grep',
		'Original','Original_as_grep',
		'Larry','Larry_with_build'
	    ) {
    $sum = 0;
    if ($method eq 'Larry') { &build_Larry; }
    else { undef $prog; }
    for (1..20) { $sum += &Time_it; }
    print "$method:  " . $sum / 20 . "\n";
}

# END
-- 
Bart Schaefer          "EARTH: Surrender IMMEDIATELY or we PICKLE DAN QUAYLE"

                                                                    "THPPFT!"
schaefer@cse.ogi.edu (used to be cse.ogc.edu)