[comp.lang.perl] extract -- field-oriented grep and expr

roger@mav.com (Roger Droz) (03/12/91)

Never underestimate the power of self-modifying code!  Take a simple perl
three instruction perl loop, and add a user interface and you have the
following "database extraction" filter, ready to take its place along
side cut, paste, sort, and grep.

The enclosed shar contains "extract", which effectively does an expr(1)
on one or more fields in a file, sending lines the meet the condition to
stdout.

Example:
	extract -t: 3n '>' 700 '&' 3n '<' '900' /etc/passwd

    Extract those lines from /etc/passwd with the 3rd colon-delimited
    field (user-id) between 700 and 900.

____________
               Roger Droz                  Domain: roger@mav.COM           
()       ()    Maverick International      UUCP: uw-beaver!gtisqr!roger
 (_______)     Mukilteo, WA 
  (     )      
   |   |       Disclaimer: "We're all mavericks here: 
   |   |                    Each of us has our own opinions,
   (___)                    and the company has yet different ones!"

--- cut here --- cut here --- cut here --- cut here --- cut here ----
#! /bin/sh
# This is a shell archive.  Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file".  To overwrite existing
# files, type "sh file -c".  You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g..  If this archive is complete, you
# will see the following message at the end:
#		"End of shell archive."
# Contents:  extract extract.man
# Wrapped by roger@hindmost on Mon Mar 11 14:58:57 1991
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'extract' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'extract'\"
else
echo shar: Extracting \"'extract'\" \(4983 characters\)
sed "s/^X//" >'extract' <<'END_OF_FILE'
X#!/usr/local/bin/perl -S
Xeval 'exec /usr/local/bin/perl -S $0 "$@"'  # this emulates #! processing on NIH machines.
X	if $running_under_some_shell;         # (remove #! line above if indigestible)
X
X#* Header4 *******************************************************************
X#*	     Copyright (c) Maverick International Inc, 1991		    **
X#*		Author: Roger Droz (roger@mav.COM)			    **
X#*									    **
X#*   This program may be freely shared on a non-profit basis, so long as    **
X#*   this copyright and authorship notice are retained. 		    **
X#*									    **
X#*   This material is provided solely on an "as-is" basis, and neither	    **
X#*   MAVERICK INTERNATIONAL INC nor the author assume any responsibility    **
X#*   for any costs arising from use or misuse of this material, including   **
X#*   all correction and damages, consequential or incidental.		    **
X#*****************************************************************************
X# $Id: extract,v 1.2 91/03/11 14:58:08 roger Exp $
X#
X# DESCRIPTION
X#       extract -- extract records from a file based on arithmatic comparison.
X#
X#       extract [-tchar] expression [file]
X#
X#	where expression:
X#	    field[opt] relop value
X#
X#               field - field number, numbering field from 1.
X#               opt   - n for numeric comparision; i for case-insensitive ASCII
X#                       comparision. Default case-sensitive ASCII.
X#               relop - relational operator >, <, =, >=, etc. as in expr(1).
X#        
X# REVISION HISTORY (Comment each entry.)
X#	$Log:	extract,v $
X# Revision 1.2  91/03/11  14:58:08  roger
X# Added copyright notice.
X# 
X# Revision 1.1  91/03/07  11:34:36  roger
X# Initial revision
X#** EndHeader ***************************************************************
X
X
X# Intialize variables.
X$[ = 1;	# array base at 1
X$| = 1; # flush stdout
X
X@program = (
X'   while (<INFILE>) {     ',
X'       split(/$tab/o, $_, $nsplits); ',
X'       print if (stuff); ',
X'   }               ',
X);
X
X$splitline = 2;
X$printif = 3;
X$tab = '[\s\n]+';
X$nsplits = 1;
X
X%numopts = (
X    '==',   '==',
X    '=',    '==',
X    '>',    '>',
X    '<',    '<',
X    '>=',    '>=',
X    '<=',   '<=',
X    '!=',   '!='
X);
X
X%stringopts = (
X    '==',   'eq',
X    '=',    'eq',
X    '>',    'gt',
X    '<',    'lt',
X    '>=',    'ge',
X    '<=',   'le',
X    '!=',   'ne',
X    ':',    '=~'
X);    
X
X$infile = '-';
X@compline = ();
X
Xif ($#ARGV < 3) {
X    $_ = shift(@ARGV);
X    while (s/^\s*('([^']*)'|"([^"]*)"|([^'"]\S*))//) {
X        push(@opts, $+);
X    }
X    if ($#ARGV >= $[) {
X        push(@opts, @ARGV[$[ .. $#ARGV]);
X    } 
X} else {
X    @opts = @ARGV;
X}
X
Xwhile ($#opts >= $[) {
X    $opt = shift(@opts);
X
X    if ($opt =~ /^-[td]/) {
X        $tab = substr($opt,3,1);
X        if ($tab =~ /[A-Z]|[a-z]/) {
X            $tab = sprintf('[%s\\n]', $tab);
X        } else {
X            $tab = sprintf('[\\%s\\n]', $tab);
X        }
X    } elsif ($opt =~ /^[\(\)\!]/) {
X        push (@compline, $opt);
X    } elsif ($opt =~ /[\|\&]/) {
X        push (@compline, (substr($opt,1,1) x 2));
X    } elsif ($opt =~ /^\s*[+-]*\d+/) {
X        ($field, $numflag) = ($opt =~ /^[+-]*(\d+)([ni])*/);
X        $nsplits = $field if ($field > $nsplits);
X        $realop = shift(@opts);
X        if ($numflag eq 'n') {
X            $relop = $numopts{$realop};
X        } else {
X            $relop = $stringopts{$realop};
X        }
X        $relop = $realop unless ($relop);
X        if ($relop eq 'eq' && $numflag eq 'i') {
X            $relop = '=~';
X            $regex = '^' . shift(@opts) . '$';
X        } elsif ($relop eq '=~') {
X            $regex = shift(@opts);
X        }
X        if ($relop ne '=~') {
X            push (@compline, sprintf('$_[%d] %s \'%s\'', $field, $relop, shift(@opts)));
X        } else {
X            push (@compline, sprintf('$_[%d] %s m/%s/%s', $field, $relop, $regex, $numflag));
X        }
X    } elsif ($opt !~ /^-/) {
X        if ($infile eq '-') {
X            $infile = $opt;
X        } else {
X            die "Second file argument \'$opt\' encountered";
X        }
X    } elsif ($opt ne '-') {
X        $0 =~ s/.*\///;
X        print stderr "Usage:
X$0 [-tc] field_number relop value [[&||] field relop value] [file]
X    -tc     where c is tab character (field separator). (Default = whitespace.)
X    field_number 
X            field number, counting from one as in cut(1).  Suffix field
X            number with n for numeric comparison, or i for case-insensitve
X            ASCII comparison.
X    relop
X            =, <, >, >=, <=, !=, : -- as in expr(1)
X            Also parentheses and | &
X
X    file    filename (default stdin).
X";
X    exit 1;
X    }            
X}
X
X$nsplits;
X$program[$printif] = sprintf("\tprint if (%s);", join(' ', @compline));
X$program[$splitline] = sprintf("\tsplit(/%s/, %s, %d);", $tab, '$_',  $nsplits+1);
Xdie 'No relations specified' if ($#compline < $[);
X
Xopen (INFILE, $infile) || die "Cannot open file $infile";
X 
Xeval join("\n", @program);
Xif ($@) {
X    print stderr join("\n", @program, "\n");
X    die $@;
X}
END_OF_FILE
if test 4983 -ne `wc -c <'extract'`; then
    echo shar: \"'extract'\" unpacked with wrong size!
fi
chmod +x 'extract'
# end of 'extract'
fi
if test -f 'extract.man' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'extract.man'\"
else
echo shar: Extracting \"'extract.man'\" \(3527 characters\)
sed "s/^X//" >'extract.man' <<'END_OF_FILE'
X.\" $Id: extract.man,v 1.2 91/03/11 14:24:13 roger Exp $
X.TH EXTRACT L "11 March 1991" "Maverick International"
X.\" Full space in nroff; half space in troff
X.de SP
X.if n .sp
X.if t .sp .5
X..
X.de EX
X.in +1i
X.nf
X..
X.de EY
X.in -1i
X.fi
X..
X.SH "NAME"
X\fBextract\fP - Perform relational database
X.I select
Xoperation on a file whose fields are delimited by a character or
Xwhitespace.
X.SP
X.SH "SYNOPSIS"
X\fBextract\fP [\fB-t\fIchar\fR] [\fIfield relop value\fP] [\fIfile\]
X.SH "DESCRIPTION"
X.B extract
Xselects lines from the input file that satisfy a relational expression.
XThe output is akin to
X.BR grep (1)
Xbecause each line matching the condition is output in its entirety.  The
Xexpression syntax is similar to
X.BR expr (1),
Xwith fields specified in a manner similar to
X.BR cut (1).
X.B extract
Xis written in perl, and is therefore slower than 
X.BR grep (1),
Xbut
X.B extract
Xcan perform relational as well as regular expression matches and isolate
Xcomparisons easily to the nth field of each record.
X.SH "OPTIONS"
X.IP \fB-d\fIchar\fR
X.IP \fB-t\fIchar\fR
XDefine the field delimiter character to be the single character,
X\fIchar\fR.  By default, fields are assumed to be delimited by one or
Xmore whitespace characters.
X.TP
X.I field
XSpecifies the field number to compared, numbering fields from 1 as in
X.BR cut (1).
XBy default, alphabetic comparison is performed.  Numeric comparison
Xmay be specified by suffixing the field number with 
X.BR n ,
Xas in 2n.
XCase insensitive alphabetic comparison may be specified by suffixing
Xthe field number with
X.BR i ,
Xbut case insensitive comparison is only supported with the =
Xand : operators.
X.TP
X.I relop
XA relational operator, as in 
X.BR expr (1):
X.nf
X    =   equal
X    >   greater than
X    <   less than
X    >=  greater than or equal
X    <=  less than or equal
X    !=  not equal
X    :   matches regular expression
X.fi
X.TP
X.I value
XA value or regular expression, as appropriate for the relational
Xoperator.
X.SP
XRegular expressions are
X.I not
Xautomatically anchored to the beginning of the field as in
X.BR expr (1).
XThe characters
X.BR "^" " and " "$"
Xmay be used to refer to the beginning and end of the
X.I field.
XThe regular expression language is that of
X.BR perl (1),
Xwhich is more like
X.BR egrep (1)
Xthan
X.BR grep (1).
X.TP
X.I file
XThe input file.  If not specified, the standard input is used.
X.SP
XSeveral relational expressions may be given, joined by the logical
Xoperators & and | (AND and OR, respectively).  Parentheses and ! for NOT
Xare also supported.
X.SP
XBecause so many of the relational operators must be quoted from the
Xshell, all of the command line arguments may be given as a single
Xargument:
X.SP
X.nf
X    extract "-t: 3n > 700 & 3n < 900" /etc/passwd
X        as opposed to
X    extract -t: 3n '>' 700 '&' 3n '<' 900 /etc/passwd
X.fi
X.SH "EXAMPLES"
X.TP
X.B extract "-t: 3n > 700 & 3n < 900" /etc/passwd
X.br
XExtract those lines from /etc/passwd with the 3rd colon-delimited field (user-id)
Xbetween 700 and 900.
X.TP
X.B extract "-t: 5i : control" /etc/passwd
XExtract those lines from /etc/passwd where the 5th colon-delimited field
Xcontains the word 'control'.  Pattern match is case insensitive.
X.SH "CAVEATS"
XThis program is called extract because select is already a reserved word in the korn shell.
X.SH "BUGS"
XThe error handling and usage aren't the greatest.  Errors like unmatched
Xparentheses are reported by dumping a perl script to stderr.
X.SP
XThe command line parser is crude.  It thinks most anything it doesn't
Xunderstand is a filename.
X.SH "AUTHOR"
XRoger Droz (roger@mav.COM)
END_OF_FILE
if test 3527 -ne `wc -c <'extract.man'`; then
    echo shar: \"'extract.man'\" unpacked with wrong size!
fi
# end of 'extract.man'
fi
echo shar: End of shell archive.
exit 0

tchrist@convex.COM (Tom Christiansen) (03/14/91)

From the keyboard of roger@mav.com (Roger Droz):
:Never underestimate the power of self-modifying code!  Take a simple perl
:three instruction perl loop, and add a user interface and you have the
:following "database extraction" filter, ready to take its place along
:side cut, paste, sort, and grep.
:
:The enclosed shar contains "extract", which effectively does an expr(1)
:on one or more fields in a file, sending lines the meet the condition to
:stdout.
:
:Example:
:	extract -t: 3n '>' 700 '&' 3n '<' '900' /etc/passwd
:
:    Extract those lines from /etc/passwd with the 3rd colon-delimited
:    field (user-id) between 700 and 900.

I truly hate to be the one to say this, but how is this better than awk?

--tom