[comp.unix.wizards] Word-oriented GREP

flee@cs.psu.edu (Felix Lee) (04/15/91)

(from comp.unix.questions)
| When I use the command "grep V\[0-9\]\[0-9\]\[0-9\] fred.c" it returns
| 	#define VERSION "V002"
|   or somesuch.  What I would really like is just the string of characters
|   which matched:
| 	V002

I've wanted an "xgrep" tool for a while.  It would scan an input
stream for a pattern and print any part of the stream that matches.

Randal Schwartz offers a Perl solution, but you can't escape line
boundaries.  Consider the pattern
	^(.*\n){0,3}.*Able.*(\n.*){0,3}$
which means, print three lines of context around any line that
contains "Able".  Generalized context grep.  You can write patterns
for any type of simple context.

(You can actually do this in Perl, but it becomes extremely
inefficient for large files, because you can only apply patterns to
strings, not streams.)
--
Felix Lee	flee@cs.psu.edu

lyda@acsu.buffalo.edu (kevin lyda) (04/29/91)

In article <b08Gc5!p1@cs.psu.edu> flee@cs.psu.edu (Felix Lee) writes:
.(from comp.unix.questions)
.| When I use the command "grep V\[0-9\]\[0-9\]\[0-9\] fred.c" it returns
.| 	#define VERSION "V002"
.|   or somesuch.  What I would really like is just the string of characters
.|   which matched:
.| 	V002

.Randal Schwartz offers a Perl solution, but you can't escape line
.boundaries.  Consider the pattern
.	^(.*\n){0,3}.*Able.*(\n.*){0,3}$
.which means, print three lines of context around any line that
.contains "Able".  Generalized context grep.  You can write patterns
.for any type of simple context.

.(You can actually do this in Perl, but it becomes extremely
.inefficient for large files, because you can only apply patterns to
.strings, not streams.)

why not cut down your search space by using grep to find the lines with
the matching patterns and then using perl, or some other unix tool to grab
the pattern.... from the previous example you could do:

grep V\[0-9\]\[0-9\]\[0-9\] fred.c | tr ' ' \012 | grep V\[0-9\]\[0-9\]\[0-9\]

of course that assumes that your field separators are spaces.

	a non-wizard,
		kevin

flee@cs.psu.edu (Felix Lee) (04/29/91)

>why not cut down your search space by using grep to find the lines with
>the matching patterns and then using perl, or some other unix tool to grab
>the pattern.... from the previous example you could do:
>grep V\[0-9\]\[0-9\]\[0-9\] fred.c | tr ' ' \012 | grep V\[0-9\]\[0-9\]\[0-9\]

Well, yes, you can do that if you want a word-oriented grep.

My point was, I don't want a word-oriented grep.  I don't want a
line-oriented grep either.  I want a character-oriented grep, a grep
that will just grab matching substrings from an arbitrarily stream.
And from this tool you can do word-oriented or line-oriented or
whatever-oriented grepping.

With the current line-oriented grep, you cannot search for a pattern
that spans lines.  Say you want to find occurrences of "the dog" in a
file, where the words can be separated by any whitespace, including
newlines.  You cannot do this easily with existing tools.
--
Felix Lee	flee@cs.psu.edu

Tom Christiansen <tchrist@convex.COM> (04/29/91)

From the keyboard of flee@cs.psu.edu (Felix Lee):
:My point was, I don't want a word-oriented grep.  I don't want a
:line-oriented grep either.  I want a character-oriented grep, a grep
:that will just grab matching substrings from an arbitrarily stream.
:And from this tool you can do word-oriented or line-oriented or
:whatever-oriented grepping.
:
:With the current line-oriented grep, you cannot search for a pattern
:that spans lines.  Say you want to find occurrences of "the dog" in a
:file, where the words can be separated by any whitespace, including
:newlines.  You cannot do this easily with existing tools.
	    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Depends on your idea of easy, I guess.  If you don't mind loading the
whole file in memory to do your work, you can use perl to slurp the whole
stream into the pattern space and do pattern matches to your heart's
content -- ones that are beyond grep's wildest dreams, too.

For example, here's a quick-and-dirty attempt to grep out function
declarations.  It "knows" that I (and all other reasonable people :-) 
always put the type on a preceding line, kind of like this:

    char *
    funct(arg)
	some_type arg;
    {
	blah
	blah blah 
	blahdy blahdy blah 
    } 

And what I'd like back is this:

    char *
    funct(arg)
	some_type arg;


Here's the code.  It's got some extra foo to get rid of C junk I don't
want to see.  I suppose I could run it through cpp if I were serious.

    #!/usr/bin/perl

    undef $/;		# disable input record separator
    $_ = <>;		# slurp input into pattern space
    $* = 1; 		# make ^ and $ work more intuitively

    s#/\*#\200#g;	# trim comments first
    s#\*/#\201#g;
    s#\200[^\201]*\201##g;

    s/^#.*//g;		# and cpp directives

    # now for the real work
    s/((\w+[*\s]*){0,2}\n(\w+)\s*\([^)]*\)[^{]*)\{/print $1, "\n"/eg;


You could probably put this into a script that took the guts of the
LHS of the last line and did all the rest for you.  The last line is
really all the matters anyway.

One problem with this kind of operation is that if you aren't careful
about your regexps, it can take a LONG time.  In the example give above,
if you change the {0,2} to a mere *, you'll be waiting for quite a while
as it tries all the possibilities.  Limiting function types to 0 to 2
words speeds it up into something tolerable.  You've got the same problem
with that \n there.  If you make it optional, the pattern matcher has 
a lot less to anchor it anywhere, and you get exponential blow-up.

But most greppers aren't usually worried about this kind of thing.  
I'm not sure how often it would really come up.


--tom