[comp.unix.questions] Fuzzy grep?

rfinch@caldwr.water.ca.gov (Ralph Finch) (11/03/90)

Is there something like grep, except it will (easlly) search an entire
file (not just line-by-line) for regexp's near each other? Ideally it
would rank hits by how much or how close they match, e.g.

fzgrep 'abc.*123' filename

would return hits not by line number but by how close abc & 123 are
found together.  Also it wouldn't matter what order the regexp's are.
-- 
Ralph Finch			916-445-0088
rfinch@water.ca.gov		...ucbvax!ucdavis!caldwr!rfinch
Any opinions expressed are my own; they do not represent the DWR

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (11/06/90)

In article <242@locke.water.ca.gov> rfinch@caldwr.water.ca.gov (Ralph Finch) writes:
: Is there something like grep, except it will (easlly) search an entire
: file (not just line-by-line) for regexp's near each other? Ideally it
: would rank hits by how much or how close they match, e.g.
: 
: fzgrep 'abc.*123' filename
: 
: would return hits not by line number but by how close abc & 123 are
: found together.  Also it wouldn't matter what order the regexp's are.

I sincerely doubt you're going to find a specialized tool to do that.
But if you just slurp a file into a string in Perl, you can then
start playing with it.  For example, if your search strings are fixed,
you can use index:

	#!/usr/bin/perl
	undef $/;
	while (<>) {	# for each file
	    $posabc = index($_, "abc");
	    next if $posabc < 0;
	    $pos123 = index($_, "123");
	    next if $pos123 < 0;
	    $diff = $posabc - $pos123;
	    $diff = -$diff if $diff < 0;
	    print "$ARGV: $diff\n";
	}

Of course, you'd probably want to make a subroutine of that middle junk.
Or you can say:

	#!/usr/bin/perl
	undef $/;
	while (<>) {	# for each file
	    tr/\n/ /;			# so . matches anything
	    (/(abc.*)123/ || /(123.*)abc/)
		&& print "$ARGV: " . (length($1)-3) . "\n"
	}
	
Those .*'s are going to be expensive, though.  Maybe

	#!/usr/bin/perl
	undef $/;
	while (<>) {	# for each file
	    next unless /abc/;
	    $posabc = length($`);
	    next unless /123/;
	    $pos123 = length($`);
	    $diff = $posabc - $pos123;
	    $diff = -$diff if $diff < 0;
	    print "$ARGV: $diff\n";
	}

Of course, none of these solutions is going to find the closest pair,
necessarily.  To do that, use a nested split, which also works with arbitrary
regular expressions:


	#!/usr/bin/perl
	undef $/;
	while (<>) {	# for each file
	    $min = length($_);
	    @abc = split(/abc/, $_, 999999);
	    next if @abc == 1;		# no match
	    &try(shift(@abc), 0, 1);
	    &try(pop(@abc),   1, 0);
	    foreach $chunk (@abc) {
		&try($chunk, 1, 1);
	    }
	    next if $min == length($_);
	    print "$ARGV: $min\n";
	}

	sub try {
	    ($hunk, $first, $last) = @_;
	    @pieces = split(/123/, $hunk, 999999);
	    if ($first && $min > length($pieces[0]) {
		$min = length($pieces[0]);
	    }
	    if ($last && $min > length($pieces[$#pieces]) {
		$min = length($pieces[$#pieces]);
	    }
	}

Or something like that...

Larry Wall
lwall@jpl-devvax.jpl.nasa.gov

kehoe@scotty.dccs.upenn.edu (Brendan Kehoe) (11/06/90)

In <10240@jpl-devvax.JPL.NASA.GOV>, lwall@jpl-devvax.JPL.NASA.GOV writes:
>I sincerely doubt you're going to find a specialized tool to do that.

  .. tons & tons of Perl code by its dad ..
>
>Or something like that...
>

 Hahahaha. This made my day. [Sad, but true.]



Brendan Kehoe | Soon: brendan@cs.widener.edu [ Today? Could it be? <Ohm...> ]
For now: kehoe@scotty.dccs.upenn.edu | Also: brendan.kehoe@cyber.widener.edu
  "The latest polls indicate you're in danger of losing touch with the
	     common man."   "Oh DEAR ... heaven forfend!"

MarkD@Aus.Sun.COM (11/06/90)

kehoe@scotty.dccs.upenn.edu (Brendan Kehoe) writes:

>In <10240@jpl-devvax.JPL.NASA.GOV>, lwall@jpl-devvax.JPL.NASA.GOV writes:
>>I sincerely doubt you're going to find a specialized tool to do that.

>  .. tons & tons of Perl code by its dad ..
>>
>>Or something like that...
>>

> Hahahaha. This made my day. [Sad, but true.]

Agreed.  But what gets me is the number of different ways he manages to
sneak in these dang Perl lessons! Just when I was about to Beta test my
"Impending Perl lesson" detector, he goes and changes his posting
patterns - sigh, maybe I should re-write my detector in Perl :-)


------------         -----------------     --------------------
Mark Delany          markd@Aus.Sun.COM     ...!sun!sunaus!markd
------------         -----------------     --------------------

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (11/07/90)

In article <markd.657881866@sunchat> MarkD@Aus.Sun.COM writes:
: kehoe@scotty.dccs.upenn.edu (Brendan Kehoe) writes:
: 
: >In <10240@jpl-devvax.JPL.NASA.GOV>, lwall@jpl-devvax.JPL.NASA.GOV writes:
: >>I sincerely doubt you're going to find a specialized tool to do that.
: 
: >  .. tons & tons of Perl code by its dad ..
: >>
: >>Or something like that...
: >>
: 
: > Hahahaha. This made my day. [Sad, but true.]
: 
: Agreed.  But what gets me is the number of different ways he manages to
: sneak in these dang Perl lessons! Just when I was about to Beta test my
: "Impending Perl lesson" detector, he goes and changes his posting
: patterns - sigh, maybe I should re-write my detector in Perl :-)

It'd be fairly trivial:

	#!/usr/bin/perl
	while (<>) {
	    /^:.*[?!]/ && warn "Impending Perl lesson!!!!\n";
	}

:-)

Larry