[net.sources] Yet another grep-like utility

nz@wucs.UUCP (Neal Ziring) (09/09/85)

	Below is the shell archive for a new utility developed here
at Washington University by a team of crack programmers (myself :-).
It is 
	revgrep \- search a file for a pattern backwards, provide tails

For more information, skip past the included shell archive.


-----------------clip here ------------------
#! /bin/sh
# This is a shell archive, meaning:
# 1. Remove everything above the #! /bin/sh line.
# 2. Save the resulting text in a file.
# 3. Execute the file with /bin/sh (not csh) to create the files:
#	revgrep/Makefile
#	revgrep/revgrep.c
#	revgrep/revgrep.1
# This archive created: Mon Sep  9 14:28:49 1985
export PATH; PATH=/bin:$PATH
if test -d 'revgrep'
then
	echo shar: putting files into existing directory "'revgrep'"
else
	mkdir 'revgrep'
fi
if test -f 'revgrep/Makefile'
then
	echo shar: will not over-write existing file "'revgrep/Makefile'"
else
cat << \SHAR_EOF > 'revgrep/Makefile'
#
# revgrep: search files for a pattern, backwards
#
#     Revgrep used to be composed of three source files, 
#   revgrep.c, revio.c, and revgrep.h.  Since none of the
#   three were particularly big, they were combined into 
#   one source file, revgrep.c.
#
SRCS=  revgrep.c
OBJS= revgrep.o 
CFLAGS= -O
DESTDIR= /usr/local/bin

all: revgrep

install: revgrep
	install -s -m 755 revgrep ${DESTDIR}

revgrep: ${OBJS}
	cc ${CFLAGS} -o revgrep ${OBJS}
SHAR_EOF
fi # end of overwriting check
if test -f 'revgrep/revgrep.c'
then
	echo shar: will not over-write existing file "'revgrep/revgrep.c'"
else
cat << \SHAR_EOF > 'revgrep/revgrep.c'

/* **************************************************************
 * revgrep: search files for a pattern backwards
 *
 *        revgrep is a variant of the grep(1) family, designed
 *    to behave like grep, but do all line input in reverse.
 *    (regular expression are matched left-to-right, as usual.)
 *    the library routines re_comp and re_exec are the basis for
 *    the regular-expression matching.
 *
 *    in addition to the functionality of grep, revgrep provides
 *    the functionality of tail(1), but with the option to base
 *    the length of the tail on a match of a regular expression
 *    as well as on the line count.  further, revgrep does not
 *    have tail's buffer-size limitations (at the price of being
 *    a little slower).
 *
 *    the basic calling syntax for revgrep is as follows:
 * 
 *            % revgrep [ options ] [ expression ] [ file ... ]
 *
 *    if no files are given, revgrep reads the std. input.
 *    meaningful options are:
 *
 *           -v		print only lines that don't match
 *           -c         print only a count of matching lines
 *           -l         print only filenames of files with matches
 *           -n         print relative line number with matches
 *           -b		print block number in file with matches
 *           -s         no output, only status
 *           -h         don't print filenames with matches
 *           -y         fold upper case to lower on input
 *           -e         next argument is the expression
 *
 *           -t         text from first match (backwards) to 
 *                      end is printed in correct order.
 *           -nnn       forego matches, do a -t on this many lines
 */

#include <stdio.h>
#include <strings.h>
#include <ctype.h>
#include <signal.h>
#include <sys/param.h>
#include <sys/stat.h>
#include <sys/file.h>

#define RBUF 1024

int rev_fd = -1; 
int sty    =  0;
int fflag  =  1;
extern int errno;
int exitval = 1;
#define Is_nonseek(FD,SB) \
  ( (SB.st_mode == 0) || (SB.st_mode & S_IFMT == S_IFSOCK) || (isatty(FD)) )


char dont_match = 0;
char print_count= 0;
char print_names= (char) 4;
char print_blocks=0;
char print_lnum = 0;
char print_matches= 1;
char fold_upper = 0;
char use_stdin  = 1;
char do_tail    = 0;
char silent_mode= 0;
int  tail_cnt   = 0;
char *tmp_name   = NULL;

 main(argc,argv)
     int argc;
     char *argv[];
{
    char *s1, *s2;
    char *expr = NULL;
    int i;
    struct stat sb;
    
    /* first try to process options */
    for(s1 = *++argv, argc--; argc > 0 && s1 != NULL; s1 = *++argv, argc--) {
	if (*s1 != '-') break;
	for(s2=s1; *++s2 != '\0'; ) {

	    switch (*s2) {
	      case 'v':
		dont_match++;
		break;
	      case 'c':
		print_count++;
		print_matches=0;
		break;
	      case 'n':
		print_lnum = 1;
		break;
	      case 'l':
		print_names = 1;
		print_matches=0;
		break;
	      case 'b':
		print_blocks++;
		break;
	      case 's':
		print_names = 0;
		print_matches=0;
		print_blocks = 0;
		print_count = 0;
		do_tail     = 0;
		silent_mode = 1;
		break;
	      case 'h':
		print_names = 0;
		break;
	      case 'i':
	      case 'y':
		fold_upper = 1;
		break;
	      case 't':
		do_tail = 1;
		print_matches = 0;
		break;
	      case 'e':
		expr = *++argv;
		break;
	      default:
		if ( isdigit(*s2) ) {
		    do_tail = 1;
		    print_matches = 0;
		    tail_cnt = atoi(s2);
		    for( ; *s2 != '\0' ; s2++);
		    s2--;
		}
		else {
		    if (silent_mode) exit(2);
		    fprintf(stderr,"revgrep: Unknown flag, '%c'.\n",*s2);
		    fprintf(stderr,
		       " usage: revgrep [ -slbchnyet# ] expr [ file ] ...\n");
		    exit(2);
		}
	    }
	}
    }

    if (s1 != NULL && expr == NULL && tail_cnt == 0) {
	expr = s1;
	argv++; argc--;
    }
    if (expr == NULL && tail_cnt == 0) exit(2);
    if (expr != NULL) {
	char *re_comp();
	if ( (s1 = re_comp(expr)) != NULL) {
	    if ( ! silent_mode ) fprintf(stderr,"revgrep: %s \n",s1);
	    exit(2);
	}
    }

    if (print_names > 1  &&
	( (*argv == NULL)?(1):(*(argv+1) == NULL))) print_names = 0;

    if (*argv == NULL) {
	rev_fd = 0;
	sty = 0;
	if ( fstat(0,&sb) >= 0 ) {
	    if ( Is_nonseek(0,sb) )
	      {
		if (  tmp_setup(0) < 0 ) {
		    if ( ! silent_mode ) perror("revgrep: tmp_setup");
		    exit(2);
		}
	      }
	}
	else {
	    exit(2);
	}
    }
    i = rgrep(argv);
    
    if ( tmp_name != NULL )
      unlink(tmp_name);

    exit(i);
}
		

/* **************************************************************
 * tail: print a tail, given a file position
 *
 *       This routine copies to stdout all the data after a 
 *   certain file position.  It checks various global flags
 *   before starting the copy, and prints a message if
 *   appropriate.  The file of which a tail is to be done
 *   is assumed to be open on rev_fd.
 */
 tail(fpos,fname)
     int fpos;
     char *fname;
{
    int len;
    char buf[RBUF];

    if (print_names) printf("%s:",fname);
    if (print_blocks) printf("starting in %d",fpos/512);
    if (print_names || print_blocks) putchar('\n');

    if (lseek(rev_fd,(fpos==0)?(0):(fpos+1),0) < 0) {
	perror("revgrep: tail");
	return(-1);
    }
    while( (len=read(rev_fd,buf,RBUF)) > 0) 
	fwrite(buf,sizeof(char),len,stdout);

    return(0);
}



/* **************************************************************
 * rgrep: backwards grep a set of files, or stdin
 *
 *      Rgrep performs a backwards grep on an input vector
 *   of files.  If the input vector is Null, then the stdin
 *   is assumed to be the input file.
 *   This routine uses rline() to get the next line backwards,
 *   and uses doline() to do pattern-matching and output on
 *   the line received.  
 */
 rgrep(files)
     char *files[];
{
    char *s1, *s2;
    char *lbuf;
    int  pos, lcnt, match, match_cnt;
    char didone = 0;

    for(didone = 0;
	didone == 0 || *files != NULL;
	files += ((*files==NULL)?0:1))
      {
	char *f, *rline();
	f = *files;
	for(lcnt=0, match_cnt=0;
	    ( lbuf = rline(&pos,f) ) != NULL;
	    f=NULL, lcnt++, didone=1)
	  {
	      if (match = doline(lbuf,pos,lcnt,*files)) {
		  match_cnt++;
		  exitval=0;
		  if (do_tail) {
		      tail(pos,*files);
		      didone = 1;
		      break;
		  }
		  if (print_names && print_matches == 0 && print_count == 0) {
		      printf("%s\n",*files);
		      break;
		  }
	      }
	  }
	if (print_count) {
	    if (print_names) printf("%s:%d\n",*files,match_cnt);
	    else printf("%d\n",match_cnt);
	}
    }
    return(exitval);
}

/* **************************************************************
 * doline: match and handle a single line of input
 *
 *       This line receives a single text line, and performs
 *    pattern matches on it if necessary.  Depending on the
 *    various options, the line may be output with a message
 *    two.
 *       When a match of any kind occurs, 1 is returned.
 *    Otherwise 0 is returned.
 */
 doline(lbuf,pos,lin,fname)
     char *lbuf;
     int pos, lin;
     char *fname;
{
    register char *s1, *s2;
    int i,j,m;

    if (tail_cnt > 0) {
	    if (lin >= tail_cnt || pos <= 1) return(1);
	    else return(0);
    }

    if (fold_upper) {
	    char cmpbuf[RBUF*2];
	    for(s1=lbuf, s2=cmpbuf;
		*s1 != '\0' && (s2-cmpbuf) < (RBUF*2); s1++, s2++ )
	      *s2 = (isupper(*s1))?(tolower(*s1)):(*s1);
	    *s2 = '\0';
	    m = re_exec(cmpbuf);
    }
    else  m = re_exec(lbuf);

    m = (dont_match)?(! m):(m);
    if (! m) return(0);
 
    if (print_matches) {
	if (print_names) printf("%s:",fname);
	if (print_lnum) printf("-%d:",lin);
	if (print_blocks) printf("%d:",pos/512);
	printf("%s\n",lbuf);
    }
    return(1);
}


/* **************************************************************
 * tmp_setup: read a non-seekable channel into a temp file
 *
 *       This routine reads the contents of a file descriptor
 *  into a temporary file, and then passes back the name of the
 *  file in a NULL-terminated vector.
 *
 *       A signal catcher is set up to delete the file if
 *  the user hits ^C or ^\
 */
 tmp_setup(ifd)
     int ifd;
{
    static char nambuf[60];
    int ofd, len;
    int sig_exit();
    char dbuf[RBUF];

    sprintf(nambuf,"/tmp/rvg.a%d",getpid());
    if ( (ofd = open(nambuf,O_RDWR | O_CREAT,0600)) < 0) return(-1);

    if ( signal(SIGINT,sig_exit) == SIG_IGN) signal(SIGINT,SIG_IGN);
    if ( signal(SIGQUIT,sig_exit) == SIG_IGN) signal(SIGQUIT,SIG_IGN);
    if ( signal(SIGHUP,sig_exit) == SIG_IGN) signal(SIGHUP,SIG_IGN);
    
    for( ; (len = read(ifd,dbuf,RBUF)) > 0; write(ofd,dbuf,len));
    tmp_name = nambuf;
    rev_fd = ofd;
    return (ofd);
}

sig_exit()
{
    close(0); close(1); close(2); close(3); close(4); close(5); close(6);
    close(rev_fd);
    if (tmp_name != NULL) unlink(tmp_name);
    exit(2);
}



/* **************************************************************
 * rline: get next line going backwards, return pertinent info
 *
 *        This routine returns the next line going backwards from
 *   the end of a file.  It also will open a new file, if 
 *   neccesary, and if given a file name as a parameter.
 *
 *   The line is written into a static buffer, and a pointer to
 *   that small buffer is returned.  Also, the position in the
 *   file of the start of the line is returned.
 */
 char *rline(posp, fname)
     int *posp;
     char *fname;
{
	register char  *s1, *s2;
	static char *bufp;
	static char bf[RBUF];
	static char lbf[RBUF * 2], rlbf[RBUF * 2];
	static int  bf_len, fpos;

	if (fname != NULL || fflag) {
		if ( (bf_len = rblock(&fpos, bf, fname)) < 0) return(NULL);
		bufp = bf + (bf_len - 1);
		fflag = 0;
	}
	
	/* by now we have a block, so search backwards for newline */
	for(s1=lbf, s2=s1 + (2*RBUF - 1), *s2-- = '\0';
	    *bufp != '\n' &&   s2 >= s1;
	    bufp--, fpos--)
	  {
		*s2-- = *bufp;
		if (bufp == bf) {
			if ((bf_len = rblock(&fpos,bf,NULL)) < 0) {
				fflag = 1;
				break;
			}
			bufp = bf + bf_len;
			fpos++;
		}
	}
	fpos--;
	if ( --bufp < bf ) fflag = 1;
	*posp = fpos;

	return(++s2);
}

/* **************************************************************
 * rblock: return next block going backward in a file
 *
 *       This routine reads the next-last block in a file,
 *    or starts a new file, given a name.
 *    The flag  sty  tells whether the current file descriptor
 *    has been stat-ed yet.  If not, it is implicit that the
 *    blocks will be taken from the end of the file.
 *
 *       Note, if this routine fails with errors 29 or 5 then
 *    a message is printed and the whole program aborts.
 *
 *      The length of the block read-and-returned is returned.
 */
 rblock(fposp, buf, fname)
     int *fposp;
     char *buf, *fname;
{
	static int nbp;
	int    len;

	if (fname != NULL) {
		if ( rev_fd > -1 ) close(rev_fd);
		if ( (rev_fd=open(fname,O_RDONLY,0)) < 0) {
			fprintf(stderr,"revgrep: ");
			perror(fname);
			return(-1);
		}
		sty = 0;
	}

	if (! sty) {
		struct stat sb;
		if (fstat(rev_fd, &sb) < 0) return(-1);
		nbp = (sb.st_size  / RBUF ) * RBUF;
		sty = 1;
	}

	if (nbp < 0) return(-1);
	if ( lseek(rev_fd,nbp,0) < 0) {
		if (errno == 29  ||  errno == 5) {
			perror("revgrep: rblock");
			exit(1);
		}
		else return(-1);
	}

	if ( (len = read(rev_fd,buf,RBUF)) < 0) return(-1);
	*fposp = nbp + len;
	nbp = nbp - RBUF;
	return( len );
}
		
	
	
	
SHAR_EOF
fi # end of overwriting check
if test -f 'revgrep/revgrep.1'
then
	echo shar: will not over-write existing file "'revgrep/revgrep.1'"
else
cat << \SHAR_EOF > 'revgrep/revgrep.1'
.TH REVGREP 1 "7 September 1985"
.UC 4
.SH NAME
revgrep \- search a file for a pattern backwards, provide tails
.SH SYNOPSIS
.B revgrep
[ option ] ...
expression [ file ] ...
.SH DESCRIPTION
.I Revgrep
is a variant of the 
.I grep
family.
The named
.I files
(standard input default)
are searched from the end towards the beginning
for occurences of the specified pattern.
The patterns recognized are limited regular expressions
of the type described the manual entry for
.IR ed (1).
The library routine
.IR regex (3)
is used to do pattern matching.
.PP
.I Revgrep 
can also be used to print a trailing portion of a file, either based on 
the occurence of a pattern, or on a number of lines.  In this function it
is subsumed by 
.IR sed (1)
and
.IR tail (1), 
and is much faster than 
.I sed(1)
for long files.
.PP
The following options are recognized.
.TP
.B \-v
All lines but those matching are printed.
.TP
.B \-c
Only a count of matching lines is printed.
.TP
.B \-l
The names of files with matching lines are listed (once) separated by newlines.
.TP
.B \-i
Upper case is folded to lower case for comparisons.
.TP
.B \-n
Each line is preceded by its relative line number from the end of the file.
.TP
.B \-b
Each line is preceded by the block number on which it was found.
This is sometimes useful in locating disk block numbers by context.
.TP
.B \-s
Silent mode.  Nothing is printed (except error messages).
This is useful for checking the error status.
.TP
.BI \-e " expression"
Same as a simple
.I expression 
argument, but useful when the
.I expression
begins with a \-.
.TP
.B \-y
Same as 
.B \-i.
.TP
.B \-t
Give a file tail whose first line matches the given pattern.
There is no limit on the length of the tail.
.TP
.I \-cnt
Give a file tail \fIcnt\fP lines long.
This performs the same function as
.IR tail (1)
but is much less efficient for operations on pipes.  There is no
limit on the length of the file tail (unlike 
.I tail,
which limits the total length of the file tail to 4096 characters).
.LP
In all cases the file name is shown if there is more than one input file.
Care should be taken when using the characters $ * [ ^ | ( ) and \\ in the
.I expression
as they are also meaningful to the Shell.  It is safest to enclose the entire
.I expression
argument in single quotes \' \'.
.LP
In the following description of regular expression
syntax, `character' excludes newline:
.IP
A \e followed by a single character other than newline matches that character.
.IP
The character ^ matches the beginning of a line.
.IP
The character $ matches the end of a line.
.IP
A 
.B .
(period) matches any character.
.IP
A single character not otherwise endowed with special
meaning matches that character.
.IP
A string enclosed in brackets [\|] matches any single character from the string.
Ranges of ASCII character codes may be abbreviated as in `a\-z0\-9'.
A ]
may occur only as the first character of the string.
A literal \- must be placed where it can't be mistaken as a range indicator.
.IP
A regular expression followed by an * (asterisk) matches a sequence of 0
or more matches of the regular expression.
A regular expression followed by a + (plus) matches a sequence of 1 or more
matches of the regular expression.
A regular expression followed by a ? (question mark) matches a sequence of
0 or 1 matches of the regular expression.
.IP
Two regular expressions concatenated match a match of the first followed
by a match of the second.
.IP
A regular expression enclosed in quoted parentheses
matches a match for the regular expression (e.g. \e( \fIx\fP \e)).
.IP
A number \fIn\fP preceded by a backslash \e matches the same string
as the \fIn\fPth
regular expression enclosed in parentheses.
.LP
The order of precedence of operators at the same parenthesis level
is [\|] then *+? then concatentation.
.SH "SEE ALSO"
ed(1),
sed(1),
sh(1),
grep(1),
bm(1),
tail(1),
body(1)
.SH DIAGNOSTICS
Exit status is 0 if any matches are found,
1 if none, 2 for syntax errors or inaccessible files.
.SH BUGS
Lines are limited to 2047 characters; longer lines are split into
chunks at most 2047 characters long.
.PP
Data from non-seekable descriptors (pipes, sockets, and ttys)
is copied to a temporary file, ``/tmp/rvg.*''.
This file will not be removed if the
program is killed with a SIGKILL or a SIGTERM.
SHAR_EOF
fi # end of overwriting check
#	End of shell archive
exit 0

----------------------------------------------

The revgrep
program operates like grep, has all the same options, and matches a class
of regular expressions with roughly the same generality as grep(1).
All searching is done from the end of files, towards the beginning
of the files.  Argument lists are processed left-to-right, which I
suppose is an inconsistency :-).   Revgrep is about 5-10% slower than
grep(1) on simple patterns, but if you need the last of two hundred
occurences of a pattern, it can be a blessing.

	Revgrep also subsumes the function of tail(1) but with the
proviso that the tail can start only at a line offset from the end
of the file, or at the first occurence of a match from the end of the
file.  Revgrep has NO LIMIT on the number of lines in the tail output.
The program was developed for the purpose of perusing the most recent
N hours or M days of huge time-stamped logfiles.  Revgrep begins output
of the required chunks of data in about 1-3 seconds, while sed(1)
used to take 30-90 seconds.

On operations on ordinary files and devices, revgrep seems to be about
10-30% slower than tail(1).  On operations on pipes and sockets revgrep
is many times slower than tail(1) because the incoming data stream is 
copied to temporary file before it is searched (hence the lack of a 
tail size restriction).

	Complaints about revgrep may be sent to nz@wucs.UUCP, where I
will read them carefully before deleting them.  Helpful comments would
be appreciated.
-- 
========
...nz (ECL - we're here to provide superior computing)
	Washington University Engineering Computer Laboratory

    "Now we'll see some proper action..." 

	old style:	... ihnp4!wucs!nz
	new style:	nz@wucs.UUCP