[comp.lang.perl] mg: multi-line grep

utashiro@sra.co.jp (Kazumasa Utashiro) (05/29/91)

When we are editing Japanese text, grep doesn't work enough
because Japanese word can be separated by newline.  That is
primary reason I made this command, but it is also useful to
search English sentence made from multiple words, which may
be folded into multiple lines.

Major features are:

	- matching across the line boundary
	- underline, bold or stand-out matched pattern
	- show lines before/after matched line
	- recursive search with wildcard
	- automatic uncompress

Man page is included but English is not very good, sorry.

Any comments for command itself or perl programming are
welcom. (//g is used correctly?)

---
K. Utashiro
utashiro@sra.co.jp

----------------------------------------------------------------------
#!/usr/local/bin/perl
'di';
'ig00';
;#
;# mg: multi-line grep
;# K. Utashiro <utashiro@sra.co.jp>
;# Original: Mar 29 1991
;# Revised: Apr 1 1991, Apr 3 1991, Apr 4 1991, Apr 15 1991, Apr 19 1991
;#	    Apr 23 1991, Apr 25 1991, May 29 1991
;# Usage:
;#	% mg 'internet control message protocol' *.txt
;#	/*
;#	 * then you will get lines including search pattern, even if
;#	 * the string is splited by newline.
;#	 */
;#
require('getopts.pl');
#require('usage.pl');
$opts = 'ilnhwv:c:eEC:ubqQYas2RP:V:FTLSmMf:p:xzdIN'; @opts = (
	'-:dINY:',
	'i::ignore case',
	'l::list filename only',
	'n::print line number',
	'h::do not display filenames',
	'w::space doesn\'t match null ("a b" matches "ab" by default)',
	'v:pattern:skip the line if matched with exp (w/o -a, -l)',
	'c:n[,n]:print n-lines before/after matched line',
	'e::use pattern as regular expression (space is special)',
	'E::use pattern as regular expression completely',
	'C:chars:continuous characters',
	'u::underline matched string (except JIS)',
	'b::make bold (print twice) matched string (except JIS)',
	'q::quote matched string (except JIS)',
	'Q::stand-out matched string (except JIS)',
	'Y::yield to -Q option even if stdout is not a terminal',
	'a::print whole file (no filename and line number)',
	's::output filename and line number separately',
	'2::duplicate line if multiple matching is occured',
	'R::search recursively',
	'P:pattern:specify search file in wildcard (w/-R)',
	'V:pattern:specify exception file in wildcard (w/-R)',
	'F::follow symbolic link of directory (w/-R)',
	'T::search text file only',
	'L::print formfeed before each matching',
	'S::get filenames from stdin',
	'm::print only splited line',
	'M::print only multiple matched line',
	'f:file:file contains search pattern',
	'p:pattern:specify search pattern',
	'x::print processing filename',
	'z::do not uncompress automatically',
);
sub usage {
    @usage = &Usage($0, $opts, "pattern [ file ... ]", @opts);
    print "usage: mg [ -options ] pattern [ file... ]\n", pop(@usage);
    exit 2;
}
($_ = $ENV{'MGOPTS'}) && do { s/^[^-]/-$&/; unshift(@ARGV, $_); };
&Getopts($opts) || &usage;
exec "nroff -man $0|" .($ENV{'PAGER'}||'more'). " -s" if $opt_N;
$opt_C =~ s/\W/\\$&/g, $opt_C = "[\\s$opt_C]" if $opt_C;
$delim = ($opt_C || '\s') . ($opt_w ? '+' : '*');
($rawdelim = $delim) =~ s/\W/\\$&/g;
$in  = join('|', (@in  = ('\033\$\@', '\033\$B')));
$out = join('|', (@out = ('\033\(J',  '\033\(B')));
$shiftcode   = '(' . join('|', @in, @out) . ')';
$optionalseq = '(' . join('|', length, @in, @out, $opt_C || '\s'). ')*';
($rawoptionalseq = $optionalseq) =~ s/\W/\\$&/g;
sub makepat {
    local($pat, $p) = @_;
    for (split(/$shiftcode/, $pat)) {
	if (/$in/o)  { $jis = 1; $p .= $optionalseq if $p; next; }
	if (/$out/o) { $jis = 0; next; }
	if ($jis)    { s/../&jis($&)/eg; $p .= $_; next; }
	s/([\200-\377])?./length($&) > 1 ? &mb($&) : &asc($&)/eg; $p .= $_;
    }
    $p =~ s/($rawdelim|$rawoptionalseq)+$//;
    length($p) ? $p : undef;
}
$opt_f && do { open(F, $opt_f) || die "$opt_f: $!\n"; chop($opt_f = <F>); };
defined $opt_p || defined($opt_p = $opt_f || shift) || &usage;
($opt_p, $opt_v) = (&makepat($opt_p), &makepat($opt_v));
($opt_P, $opt_V) = (&wildcard($opt_P), &wildcard($opt_V));
grep(eval "\$opt_$_=0,warn\"-$_ is disabled.\n\"if\$opt_$_;", 'q', 'u', 'b')
    if defined($jis);
($pre_c, $post_c) = /,/ ? split(',') : ($_, $_) if ($_ = $opt_c);
($ql, $qr, $qd, $qe) = ('>>', '<<', '', '') if $opt_q;
($ql, $qr, $qd, $qe) = &sose if ($opt_Q && ($opt_Y || -t STDOUT));
($nlqsubs, $nlsubs) = $opt_n ? ('"$qd\n$fn".++$tl.":$qe"', '"\n$fn".++$tl.":"')
			     : ('"$qd\n$fn$qe"', '"\n$fn"');
($effect = $opt_b || $opt_u) && &initsubs;
grep(eval "print \"opt_$_=/\$opt_$_/\n\" if \$opt_$_;", split(//, 'pvPVC'))
    if $opt_d;

$* = 1;
$ig = 'i' if $opt_i;
$showfname = !$opt_h && ($#ARGV > $[ || $opt_R || $opt_S);
$needinfo = $showfname || $opt_n;
$neednlsubs = !$opt_s && $needinfo;
$p_all = !($opt_m || $opt_M);
push(@ARGV, '-') if @ARGV == 0;

while (defined($file = $opt_S ? <> : shift)) {
    chop($file) if $opt_S;
    if (-d $file) { unshift(@ARGV, &getdirent($file)) if $opt_R; next; }
    next if ($opt_T && -B $file);
    open(F, $file) || do { $error = 2; warn "$file: $!\n" if !$opt_I; next; };
    open(F, '-|') || exec('zcat', $file) || die("zcat: $!\n")
	if ($file =~ /\.Z$/ && !$opt_z);
    undef $/; $_ = <F>; $/ = "\n"; close(F);
    print "$file:\n" if $opt_x;
    if ($opt_l) { eval 'print("$file\\n"),$matched++ if /$opt_p/o'.$ig; next; }
    $#x = $[ - 1;
if ($] > 4.003) {
    eval "while(/$opt_p/og$ig){push(\@x,length(\$`),length(\$&));}";
} elsif (($test || (($test = '10') =~ s/0/$`/, $test)) eq '11') {
    eval 's/$opt_p/push(@x,length($`),length($&));$&/ego'.$ig;
} else {
    $m = length;
    eval 's/$opt_p/$t=length($&);push(@x,'.$m.'-length($\')-$t,$t);$&/ego'.$ig;
}
    $fn = "$file:" if $showfname;
    if ($opt_a) {
	for ($op = 0; ($p, $len) = splice(@x, 0, 2); $op = $p + $len) {
	    $matched++;
	    $match = substr($_, $p, $len);
	    $match = &effect($match) if $effect;
	    print substr($_, $op, $p - $op), $ql, $match, $qr;
	}
	print substr($_, $op);
	next;
    }
    for ($line = 1, $op = 0; ($p, $len) = splice(@x, 0, 2); ) {
	$print = $p_all;
	@out = ($fn);
	$line += (substr($_, $op, $p - $op) =~ tr/\n/\n/) if $opt_n;
	$op = $p;
	$pnl = rindex($_, "\n", $p - 1) + 1;
	for ($n = $pre_c, $tl = $line; $n && $pnl > 0; $n--, $tl--) {
	    $pnl = rindex($_, "\n", $pnl - 2) + 1;
	}
	push(@out, $opt_s ? $line : $tl, ':') if $opt_n;
	$left = substr($_, $pnl, $p - $pnl);
	$left =~ s/\n/$nlsubs/gee if $neednlsubs;
	push(@out, "\n") if ($opt_s && $needinfo);
	push(@out, $left);
	for (;; ($p, $len) = splice(@x, 0, 2)) {
	    $match = substr($_, $p, $len);
	    $print = (index($match, "\n") >= $[) if $opt_m;
	    $match = &effect($match) if $effect;
	    $match =~ s/\n/$nlqsubs/gee if $neednlsubs;
	    push(@out, $ql, $match, $qr);
	    $nnl = index($_, "\n", $p += $len);
	    last if ($nnl < $x[$[] || $opt_2 || @x < 2);
	    $opt_M && $print++;
	    push(@out, substr($_, $p, $x[$[] - $p));
	}
	for ($n = $post_c; $n--; $nnl = $tnnl) {
	    last if ($tnnl = index($_, "\n", $nnl+1)) < 0;
	}
	$right = substr($_, $p, $nnl - $p);
	next if ($opt_v ne '' && substr($_, $pnl, $nnl - $pnl) =~ /$opt_v/o);
	$right =~ s/\n/$nlsubs/gee if $neednlsubs;
	push(@out, $right, "\n");
	unshift(@out, "\f\n") if ($opt_L && $matched);
	print @out if ($print && ++$matched);
    }
}
exit($error || !$matched);
sub asc {
    local($_) = @_;
    $opt_E || s/\s/$delim/ || $opt_e || s/\W/\\$&/, $_;
}
sub mb {
    local($_, @_) = ($_[$[], split(//, shift));
    $_ = sprintf("\\%03o\\%03o", ord(shift), ord(shift)) if @_ == 2;
    $opt_E ? $_ : $_.$delim;
}
sub jis {
    $_[$[] =~ s/(\W)/\\$1/g;
    $_[$[] . $optionalseq;
}
sub initsubs {
    @ul[1,2] = ("_\010", "__\010\010"); @bs[1,2] = ("\010", "\010\010");
    $s  = 'sub effect{$_[$[]=~s/[\\200-\\337].|./$l=length($&);';
    $s .= '$ul[$l].' x $opt_u . '$&.$bs[$l].' x $opt_b . "\$&/ge;shift;}\n";
    eval $s; print $s if $opt_d && ($opt_u || $opt_b);
}
sub getdirent {
    local($dir, $_, @ent) = @_;
    print "Entering \"$dir\".\n" if $opt_d;
    return @ent if (-l $dir && !$opt_F);
    opendir(DIR, $dir) || do { warn "$dir: $!\n"; return @ent; };
    push(@ent, $_) while ($_ = readdir(DIR));
    close(DIR);
    @ent = grep(!m#^\.\.?$# && (-d "$dir/$_" || $opt_P eq '' || m/$opt_P/o)
		&& ($opt_V eq '' || !m/$opt_V/o) && s#^#$dir/#, sort @ent);
}
sub wildcard {
    local($_) = @_;
    s#\\?.#$_ = $&; s/\\?([_0-9A-Za-z])/$1/ || /\\./ || s/[*]/.*/ ||
	s/[|]/\$|^/ || tr/?{,}[]\-/.(|)[]\-/ || s/./\\$&/; $_;#ge;
    length($_) ? "^$_\$" : undef;
}
sub sose {
    do 'ioctl.ph' || do 'sys/ioctl.ph';
    require('termcap.pl'); $ospeed = 1 unless $ospeed;
    &Tgetent; $so = &Tputs($TC{'so'}); $se = &Tputs($TC{'se'});
    ($so, $se, $se, $so);
}

;#------------------------------------------------------------
;# usage.pl: make a string for usage line.
;#	by K. Utashiro <utashiro@sra.co.jp> on Sep 7 1990
;#	Revised by utashiro on Mar 16 1991
;#	Revised by utashiro on Mar 20 1991
;#
;# Syntax: &Usage($command, $option, $trailer, @arglist);
;#	$command: command name (maybe $0)
;#	$option:  option string same as &Getopt
;#	$trailer: trailer string (optional)
;#	@arglist: description for options which takes argument (optional)
;#		  format is "option character : argument : description"
;#		  where argument and description are optional.
;#		  special form '-:xyz' hides options -x, -y, -z.
;#
;# &Usage returns list of two strings where 1st string is for usage
;# line and 2nd is for description.
;#
;# Example:
;#	$opts = 'deg:u:s:x'; @arglist = (
;#		'-:x',			# means -x is secret option
;#		'd::debug',
;#		'g:group',
;#		'u:user:user name',
;#	);
;#	unless (&Getopts($opts)) {
;#		print &Usage($0, $opts, 'file ...', @arglist);
;#		exit(1);
;#	}
;#
;# Result:
;#	usage: sample [ -d ] [ -e ] [ -g group ] [ -u user ] [ -s : ] file ...
;#		-d       debug
;#		-u user  user name
;#
sub Usage {
    package usage; reset('a-z');
    local($cmd, $opt, $trailer, @arglist) = @_;
    for (@arglist) {
	($name, $arg, $desc) = split(/:/, $_, 3);
	if ($name eq '-') {
	    grep($hide{$_}++, split('', $arg)); next;
	}
	next if $hide{$name};
	$arg{$name} = $arg;
	$desc{$name} = $desc;
	$w = length($arg) if ($desc && $w < length($arg));
    }
    $cmd =~ s#.*/##;
    push(@usage, 'usage:', $cmd);
    while ($opt =~ /^\s*(.)(:?)/) {
	$opt = $';
	next if $hide{$1};
	push(@opts, $1);
	push(@usage, '[', "-$1");
	push(@usage, $arg{$1} || $2) if $2;
	push(@usage, ']');
    }
    push(@usage, $trailer) if $trailer;
    for (grep($desc{$_}, @opts)) {
	push(@desc, sprintf("\t-%s %-${w}s  %s\n", $_, $arg{$_}, $desc{$_}));
    }
    (join(' ', @usage)."\n", join('', @desc));
}
1;
;#------------------------------------------------------------
.00;			# finish .ig
 
'di			\" finish diversion--previous line must be blank
.nr nl 0-1		\" fake up transition to first page again
.nr % 0			\" start at page 1
';<<'.ex'; #__END__ ############# From here on it's a standard manual page #
.TH MG 1 "May 29, 1991"
.AT 3
.SH NAME
mg \- multi-line grep
.SH SYNOPSIS
.B mg
[
.B options
]
.I pattern
[
.I file
\&... ]
.SH DESCRIPTION
.I Mg
searches the specified pattern from files or standard input
and print lines which contain the pattern.  It has almost
same function as Unix command
.IR grep (1)
but is distinguished from grep because the matching is done
across the line boundaries.
.PP
For example, to find a sentence ``internet control message
protocol'' from many RFC texts, you can say
.nf

	mg\ \-i 'internet control message protocol'\ *.txt

.fi
Match will occur for sequence of words `internet',
`control', `message', `protocol' separated by any number of
whitespace characters including null string (\-w option
avoids null matching).
.PP
.I Mg
is also very useful to find a word from Japanese text
because Japanese words are not separated by whitespaces and
newline character can be inserted at any place of the text.
As a matter of fact,
.I mg
is made for Japanese string search originally.  Any Japanese
codes including JIS, Shift-JIS, EUC can be handled
hopefully, but JIS code disables some features and may make
the command confused when searching a short string.
.PP
If the file has .Z suffix, it is uncompressed before search.
.SH ENVIRONMENT
Environment variable MGOPTS is used as a default options.
.SH OPTIONS
.LP
GREP COMPATIBLE OPTIONS:
.RS
.IP \-i
Ignore case.
.IP \-l
List filename only.
.IP \-n
Show line number.
.IP \-h 
Do not display filenames even if multiple filenames are
specified by command line.
.RE
.LP
SLIGHTLY DIFFERENT OPTIONS:
.RS
.IP \-w 
Word sensitive.  Space character in the pattern matches null
string by default.  So pattern ``a\ b'' matches to string
``ab''.  With this option, space character matches single or
more space.
.IP "\-v \fIpattern\fP"
Skip the line if matched with pattern.  Don't print the
matched line if it matched to the pattern specified with
this option.  This option doesn't have any effect used with
\-a or \-l option.
.RE
.LP
OTHER OPTIONS:
.RS
.IP "\-c \fIn[,n]\fP"
Print n-lines before/after matched line.  Only single line
is displayed by default, but you can specify the lines
before or after matched line to be printed.  If ``\-c\ 3''
options is suplied, three lines before and after matched
line will be displayed together.  You can see only after
three lines by ``\-c 0,3'' option.
.IP \-e 
Use the pattern as a regular expression in 
.IR perl (1)
but space is treated specially.  With this option, you can
use
.I mg
like
.IR egrep (1)
like this:
.nf

	mg \-e 'foo bar|\^goo car|\^hoo dar' ...

.fi
See
.IR perl (1)
for detail of regular expression.
.IP
Note that if you are running
.IR jperl (1)
kanji code can be used in range expression, but you have to
use \-E option to do that.
.IP \-E 
Use the pattern as regular expression in
.I perl
completely.
.IP "\-C \fIchars\fP"
Continuous character.  If you want search sentence continued
by other than white space characters, continuous characters
can be set by this options.  For example, next command
find a sentence even if it is quoted by `>' or `|' mark.
.nf

	mg \-C '>\^|\^' 'ninja shogun fujiyama' `mhpath all`

.fi
To search a pattern in C style comment:
.nf

	mg \-C '/*' 'setuid scripts' perl.c

.fi
.IP
Note that continuous characters don't have to be found only
top of the string.  So ``foo\ bar'' matches a string
``foo>>bar'' on the previous example.
.IP \-u 
Underline matched string.  Makes a matched string underlined
by precede each character by ``_^H''.
.IP \-b 
Make bold matched string.  Makes a matched string
overstriked like ``f_^Hfo_^Hoo_^Ho''.
.IP \-q 
Quote matched string by like ``>>matched<<''.  This options
is disabled when searching JIS file.
.IP \-Q 
Use a stand-out feature of the terminal to quote the matched
string (not for JIS).  This options is ignored when the
standard output is not a terminal.  \-q and \-Q are useful
for long line.  Try
.nf

	echo $path | mg \-Q mh

.fi
.IP \-a 
Print all contents of the file.  This option makes a sense
only if used with options like \-q, \-Q, \-u, \-b, otherwise
behave almost same as 
.IR cat (1).
Filename and lines are not printed with this option.
.IP \-s 
When multiple files are specified, each matched line are
preceded by ``filename:'' string.  With this option, newline
character is inserted between filename and matched line.
This option is useful when the filenames are very long.
.IP \-2 
Usually only one line is displayed even if multiple matching
is occurred for the same line.  With this option, each
matching will be displated in different line.
.IP \-R 
Search recursively.  Only files specified by command line
arguments are searched by default.  When invoked with this
option and arguments contains a directory, it is searched
recursively.  Usually used with \-P option.
.IP "\-P \fIpattern\fP"
Search file pattern.  When directories are searched
recursively, only files which matches the `pattern' is
searched.  A `pattern' is specified in wildcard format same
as shell and `|\^' character can be used for alternative in
addition.  For example, you can find a string ``foobar''
from all C source files and makefiles like this:
.nf

	mg \-RP '*.c|\^[Mm]akefile' foobar /usr/src

.fi
.IP "\-V \fIpattern\fP"
Exception file pattern.  This is a counterpart of \-P.  Only
files which DOES NOT match the pattern will be searched.
.nf

	mg \-RV '*.[oas]' foobar /usr/src

.fi
.IP
Note that \-V option is also applied to a directory name
while \-P option has an effect only for a file.  This means
you can specify a directory name to skip, but can't specify
a directory name to search.
.IP \-F
Follow symbolic link of a directory.  Doesn't follow by
default.
.IP \-T 
Search text file only.  If this options is suplied, only
text files are searched.  Decision is made by perl operator
\-T.
.IP \-L
Print formfeed between each matchings.  Print the formfeed
character before each matched line.  This options is useful
used with \-c option and piped to pager command.
.IP \-S 
Get filenames from standard input.  Read standard input and
use each line as a filename for searching.  You can feed the
output from other command like
.IR find (1)
for
.I mg
with this option.
.IP \-m 
Print matched line only when the pattern is across the line.
.IP \-M 
Print matched line only when multiple matching is occurred
for the same line.
.IP "\-f \fIfile\fP"
Specify the file which contains search pattern.  Only the
first line of the file is used.  This options is useful when
input of KANJI character is difficult from tty driver.
.IP \-p pattern
Specify search pattern.  You don't have to use this option
explicitly because the first argument after options will be
treated as a pattern.
.IP \-x 
Print processing filename.
.IP \-z 
Disables automatic uncompress.
.RE
.SH AUTHOR
Kazumasa Utashiro <utashiro@sra.co.jp>
.br
Software Research Associates, Inc., Japan
.SH "SEE ALSO"
grep(1), perl(1), jperl(1)
.SH BUGS
.PP
Hypheneation is not supported.  Please don't tell me to
support skipping header and footer inserted by nroff...
.PP
Long JIS code pattern can not be processed.
.PP
/pattern/g syntax is not tested yet.
.ex

sherman@unx.sas.com (Chris Sherman) (05/30/91)

In <UTASHIRO.91May29164841@ext12.sra.co.jp> utashiro@sra.co.jp (Kazumasa Utashiro) writes:

>Any comments for command itself or perl programming are
>welcom. (//g is used correctly?)

>---
>K. Utashiro
>utashiro@sra.co.jp

>----------------------------------------------------------------------
>#!/usr/local/bin/perl
>'di';
>'ig00';
>;#
>;# mg: multi-line grep
>;# K. Utashiro <utashiro@sra.co.jp>
>;# Original: Mar 29 1991
>;# Revised: Apr 1 1991, Apr 3 1991, Apr 4 1991, Apr 15 1991, Apr 19 1991
>;#	    Apr 23 1991, Apr 25 1991, May 29 1991

Great program.  But I am having a problem with the recursive descent option.
I'm not sure if it is the option itself, or the mere fact it allows the 
program to run for a longer period of time, but as soon as I start the 
command, my available disk space drops about 7 Megs within the first minute 
or so.  And the more the program runs, the more disk space gets eaten.  After
a while, I get a Filesystem Full error.

This has happened on an HP and Sun4, both running perl v4p3.

The funny thing is that I can't figure out where all the disk space goes...
I stopped (cntl-Z) the program and thoroughly examined the file system, and I 
couldn't find 7 Megs worth of new files.  When I kill (kill -9)  the program,
the space comes back.

Does anyone know what is going on here?  Is the problem caused by perl,
or by this neat program?

Big Thanx,
--
Chris Sherman .................... sherman@unx.sas.com   |
              ,-----------------------------------------'
             /  Q:  How many IBM CPU's does it take to execute a job?
            |   A:  Four; three to hold it down, and one to rip its head off.