schaefer@ogicse.ogi.edu (Barton E. Schaefer) (03/27/90)
I have a sorted list of keywords, and a file (well, stdin actually, but no matter) consisting of lines each of which contains at most one of the keywords. There may be keywords not in the file, and lines with zero keywords (which are ignored), but no pair of lines both contain the same keyword. What I want to do is reorder the input lines to match the order in the list of keywords. So what's the fastest way? What I have right now goes like this: @lines = <STDIN>; # Whomp it in # Never mind how I got the sorted keywords, they're in @keys foreach $k (@keys) { # For every keyword $i = $[; # Reset the index foreach $l (@lines) { # For every line if ($l =~ /\b$k\b/) { # If it matches print splice(@lines,$i,1); # Remove and print it last; # Done with this pass } $i++; # Increment index } } The idea being that the splice shortens the search for each succeeding pass through the lines. Of course Randal would point out that it could be: grep(($k = $_, $i = 0, $found = 0, grep((!$found && ++$i && /\b$k\b/ && (print(splice(@lines,$i-1,1)), ++$found)), @lines)), @keys); I'm wondering if there isn't some trick to do this faster than either of these, perhaps by mixing grep and foreach, or using an associative array. Suggestions? -- Bart Schaefer "EARTH: Surrender IMMEDIATELY or we PICKLE DAN QUAYLE" "THPPFT!" schaefer@cse.ogi.edu (used to be cse.ogc.edu)
merlyn@iwarp.intel.com (Randal Schwartz) (03/28/90)
In article <8251@ogicse.ogi.edu>, schaefer@ogicse (Barton E. Schaefer) writes: | I have a sorted list of keywords, and a file (well, stdin actually, but | no matter) consisting of lines each of which contains at most one of the | keywords. There may be keywords not in the file, and lines with zero | keywords (which are ignored), but no pair of lines both contain the same | keyword. What I want to do is reorder the input lines to match the order | in the list of keywords. | | So what's the fastest way? What I have right now goes like this: | | @lines = <STDIN>; # Whomp it in | # Never mind how I got the sorted keywords, they're in @keys | foreach $k (@keys) { # For every keyword | $i = $[; # Reset the index | foreach $l (@lines) { # For every line | if ($l =~ /\b$k\b/) { # If it matches | print splice(@lines,$i,1); # Remove and print it | last; # Done with this pass | } | $i++; # Increment index | } | } Too much newfangled technology. Do it easy... $_ = "\n" . join("",<STDIN>); # slurrrrpp... for $k (@keys) { s/(\n)(.*\b$k\b.*\n)/print($2),$1/eg; # don't discard the previous newline "$1" } # at this point, $_ has "\n" followed by anything left... s/^\n//; print; @p=split(//,",rekcah lreP rehtona tsuJ");p:print pop(p);goto p if @p; -- /=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\ | on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III | | merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn | \=Cute Quote: "Welcome to Portland, Oregon, home of the California Raisins!"=/
lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (03/28/90)
In article <8251@ogicse.ogi.edu> schaefer@ogicse.ogi.edu (Barton E. Schaefer) writes: : I have a sorted list of keywords, and a file (well, stdin actually, but : no matter) consisting of lines each of which contains at most one of the : keywords. There may be keywords not in the file, and lines with zero : keywords (which are ignored), but no pair of lines both contain the same : keyword. What I want to do is reorder the input lines to match the order : in the list of keywords. We really don't have enough info to decide the fastest algorithm, because we'd need to know the number of keywords, the relative number of lines not containing any keyword, and the usual location of the keyword in each line, and the frequency distribution of characters in the keys and lines. However, on the assumption that pattern compilation and matching is going to dominate the timing, I'd suspect the following of being QUITE fast: $prog = <<EOF; while (<STDIN>) { study; EOF foreach $key (@keys) { $prog .= <<EOF; \$found{'$key'} = \$_ if ?\\b$key\\b?; EOF } $prog .= <<EOF; } EOF print $prog if $verbose; eval $prog; die $@ if $@; print grep($_,@found{@keys}); Several things to note. First, it turns your algorithm inside out so that it can study each line. (This also avoids the "whomp" overhead.) Second, it generates a complete loop with the pattern matches so that it avoids unnecessary evals or runtime pattern compilation inside the loop. Third, it uses the ?? search because each key only wants to match once, and there's no point in looking for the key again once it has matched. On the other hand, the associative array costs you something in storage and efficiency. Try it, and see if it's faster. Larry
schaefer@ogicse.ogi.edu (Barton E. Schaefer) (03/28/90)
In article <7561@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes: } In article <8251@ogicse.ogi.edu> schaefer@ogicse.ogi.edu (Barton E. Schaefer) writes: } : I have a sorted list of keywords, and a file (well, stdin actually, but } : no matter) consisting of lines each of which contains at most one of the } : keywords. There may be keywords not in the file, and lines with zero } : keywords (which are ignored), but no pair of lines both contain the same } : keyword. What I want to do is reorder the input lines to match the order } : in the list of keywords. } } We really don't have enough info to decide the fastest algorithm, because } we'd need to know the number of keywords, the relative number of lines not } containing any keyword, and the usual location of the keyword in each line, } and the frequency distribution of characters in the keys and lines. The keyword is almost always the last word on the line. Statistics on a sample keys/lines pair, from wc: keys: 228 228 2646 lines: 231 2076 15744 However, I the number of keywords and lines may vary a lot from trial to trial, and I may need to do this many times for different keys/lines sets in each run. I do know that the number of keywords is almost always very close to the number of lines, but there may be either a few more keys than lines or a few more lines than keys. } However, on the assumption that pattern compilation and matching is going } to dominate the timing, I'd suspect the following of being QUITE fast: [Algorithm deleted, it's included again below.] It is indeed very fast (see below), but my tests revealed a possible bug (see also below). } On the other hand, the associative array costs you something in storage and } efficiency. Try it, and see if it's faster. Here are some statistics for running Randal's suggested algorithm, my original one, and Larry's over the keys/lines input sample wc'd above: Algorithm Time in Seconds --------- --------------- Randal's: 45.85 Randal's, using grep: 45.75 Original: 40.60 Original, as nested grep: 100.85 Larry's: 11.70 Larry's, including build: 11.85 Now, an important note here: these are all averaged over twenty runs, all done by the program below, except for Larry's algorthm. To test Larry's, I had to take out the 1..20 loop and run the whole program 20 times, because any time I tried to clear the associative array used by Larry's algorithm, successive runs wouldn't print anything! And if I didn't clear the array each time, every successive run would be faster than the one before it, because it doesn't have to reallocate the array each time. Any idea why the "undef %found;" breaks this? (Patchlevel 15.) Here's the program: #! /usr/bin/perl # Whomp everything in ahead of time so tests aren't biased # Then initialize locals from these values for each test @keys = split("\n",`cat keys`); # Read the keys $init_lines = `cat lines`; # Read the lines, in Randal's format @init_lines = split("\n",$init_lines); # Make an array of lines $init_lines = "\n".$init_lines; # Add the extra newline Randal needs # # Set up locals and time a single run of the current method. # Sends the output to /dev/nul -- all we want are stats. # Note that the time does not include initalizing the locals, # nor any of the I/O manipulations. # sub Time_it { local($lines,@lines) = ($init_lines,@init_lines); open(HIDE, "> /dev/null") || die; select(HIDE); local($time) = time; do $method(); $time = time - $time; close(HIDE); select(STDOUT); $time; } # # Randal's search-and-replace algorithm # sub Randal { for $k (@keys) { $lines =~ s/(\n)(.*\b$k\b.*\n)/print($2),$1/e; } } # # Randal's algorithm, grep replacing "for $k ..." # sub Randal_as_grep { grep(($lines =~ s/(\n)(.*\b$_\b.*\n)/print($2),$1/e), @keys); } # # My original splicing algorithm # sub Original { for $k (@keys) { $i = $[; for $l (@lines) { if ($l =~ /\b$k\b/) { print splice(@lines,$i,1)."\n"; last; } $i++; } } } # # The silly double-grep variant of the above, to show what NOT to do # sub Original_as_grep { grep(($k=$_,$i=0,$j=1, grep(($j && ++$i && /\b$k\b/ && print(splice(@lines,$i-1,1)."\n"), --$j), @lines)), @keys); } # # Construct the on-the-fly program for Larry's algorithm. # There's a little extra overhead for tacking back on the "\n" that # was split() off to form @init_lines, but otherwise this is what # Larry posted. The idea is to construct an explicit test for each # key (unroll the foreach $key loop), and then apply those tests to # a line that has been prepared for search by the study command. # sub build_Larry { $prog = <<EOF; while (\$_ = shift(\@lines)) { \$_ .= "\\n"; study; EOF foreach $key (@keys) { $prog .= <<EOF; \$found{'$key'} = \$_ if ?\\b$key\\b?; EOF } $prog .= <<EOF; } EOF } # # Larry's algorithm -- eval the program and print the associative array # (why the grep? Isn't "print @found{@keys};" sufficient?). # Note the problem with "undef %found" -- if that line is there, only # the first pass through this algorithm prints anything, even though # the eval'd program should be reconstructing $found{'key'} each time. # sub Larry { eval $prog; die $@ if $@; print grep($_,@found{@keys}); # undef %found; # This breaks 2nd+ passes # %found = (); # This doesn't work either } # # For fairness, we also check the time to rebuild the program each pass. # sub Larry_with_build { &build_Larry; &Larry; undef $prog; } for $method ( 'Randal','Randal_as_grep', 'Original','Original_as_grep', 'Larry','Larry_with_build' ) { $sum = 0; if ($method eq 'Larry') { &build_Larry; } else { undef $prog; } for (1..20) { $sum += &Time_it; } print "$method: " . $sum / 20 . "\n"; } # END -- Bart Schaefer "EARTH: Surrender IMMEDIATELY or we PICKLE DAN QUAYLE" "THPPFT!" schaefer@cse.ogi.edu (used to be cse.ogc.edu)