[comp.sources.misc] v17i069: regex - REGEX Globber

johnk@wrq.com (John Kercheval) (03/26/91)
Submitted-by: John Kercheval <johnk@wrq.com>
Posting-number: Volume 17, Issue 69
Archive-name: regex/part01

Here is the shar archive of V1.10 of REGEX Globber.
This is a *IX wildcard globber I butchered, hacked and cajoled together
after seeing and hearing about and becoming disgusted with several similar
routines which had one or more of the following attributes:  slow, buggy,
required large levels of recursion on matches, required grotesque levels
of recursion on failing matches using '*', full of caveats about usability
or copyrights.

These routines are fairly well tested and reasonably fast.  I have made 
an effort to fail on all bad patterns and to quickly determine failing '*' 
patterns.  This parser will also do quite a bit of the '*' matching via 
quick linear loops versus the standard blind recursive descent.

This parser has been submitted to profilers at various stages of development
and has come through quite well.  If the last millisecond is important to
you then some time can be shaved by using stack allocated variables in
place of many of the pointer follows (which may be done fairly often) found
in regex_match and regex_match_after_star (ie *p, *t).

No attempt is made to provide general [pat,pat] comparisons.  The specific
subcases supplied by these routines is [pat,text] which is sufficient
for the large majority of cases (should you care).

Since regex_match may return one of three different values depending upon
the pattern and text I have made a simple shell for convenience (match()).
Also included is an is_pattern routine to quickly check a potential pattern
for regex special characters.  I even placed this all in a header file for
you lazy folks!

Having said all that, here is my own reinvention of the wheel.  Please
enjoy it's use and I hope it is of some help to those with need ....

                                jbk
----
#! /bin/sh
# This is a shell archive.  Remove anything before this line, then feed it
# into a shell via "sh file" or similar.  To overwrite existing files,
# type "sh file -c".
# The tool that generated this appeared in the comp.sources.unix newsgroup;
# send mail to comp-sources-unix@uunet.uu.net if you want that tool.
# Contents:  match.! match.c match.doc match.h matchmak matchtst.bat
#   readme.doc
# Wrapped by kent@sparky on Mon Mar 25 14:33:28 1991
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
echo If this archive is complete, you will see the following message:
echo '          "shar: End of archive 1 (of 1)."'
if test -f 'match.!' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'match.!'\"
else
  echo shar: Extracting \"'match.!'\" \(950 characters\)
  sed "s/^X//" >'match.!' <<'END_OF_FILE'
X
X..............................................................................
X..                                                                          ..
X.                    REGEX Globber (Wild Card Matching)                      .
X.                                                                            .
X.               A *IX SH style pattern matcher written in C                  .
X.                   V1.10 Dedicated to the Public Domain                     .
X.                                                                            .
X.                                March  12, 1991                             .
X.                                 J. Kercheval                               .
X.                         [72450,3702] -- johnk@wrq.com                      .
X..                                                                          ..
X..............................................................................
X
END_OF_FILE
  if test 950 -ne `wc -c <'match.!'`; then
    echo shar: \"'match.!'\" unpacked with wrong size!
  fi
  # end of 'match.!'
fi
if test -f 'match.c' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'match.c'\"
else
  echo shar: Extracting \"'match.c'\" \(16844 characters\)
  sed "s/^X//" >'match.c' <<'END_OF_FILE'
X/*
X EPSHeader
X
X   File: match.c
X   Author: J. Kercheval
X   Created: Sat, 01/05/1991  22:21:49
X*/
X/*
X EPSRevision History
X
X   J. Kercheval  Wed, 02/20/1991  22:29:01  Released to Public Domain
X   J. Kercheval  Fri, 02/22/1991  15:29:01  fix '\' bugs (two :( of them)
X   J. Kercheval  Sun, 03/10/1991  19:31:29  add error return to matche()
X   J. Kercheval  Sun, 03/10/1991  20:11:11  add is_valid_pattern code
X   J. Kercheval  Sun, 03/10/1991  20:37:11  beef up main()
X   J. Kercheval  Tue, 03/12/1991  22:25:10  Released as V1.1 to Public Domain
X*/
X
X/*
X   Wildcard Pattern Matching
X*/
X
X
X#include "match.h"
X
Xint matche_after_star (register char *pattern, register char *text);
Xint fast_match_after_star (register char *pattern, register char *text);
X
X/*----------------------------------------------------------------------------
X*
X* Return TRUE if PATTERN has any special wildcard characters
X*
X----------------------------------------------------------------------------*/
X
XBOOLEAN is_pattern (char *p)
X{
X    while ( *p ) {
X        switch ( *p++ ) {
X            case '?':
X            case '*':
X            case '[':
X            case '\\':
X                return TRUE;
X        }
X    }
X    return FALSE;
X}
X
X
X/*----------------------------------------------------------------------------
X*
X* Return TRUE if PATTERN has is a well formed regular expression according
X* to the above syntax
X*
X* error_type is a return code based on the type of pattern error.  Zero is
X* returned in error_type if the pattern is a valid one.  error_type return
X* values are as follows:
X*
X*   PATTERN_VALID - pattern is well formed
X*   PATTERN_ESC   - pattern has invalid escape ('\' at end of pattern)
X*   PATTERN_RANGE - [..] construct has a no end range in a '-' pair (ie [a-])
X*   PATTERN_CLOSE - [..] construct has no end bracket (ie [abc-g )
X*   PATTERN_EMPTY - [..] construct is empty (ie [])
X*
X----------------------------------------------------------------------------*/
X
XBOOLEAN is_valid_pattern (char *p, int *error_type)
X{
X    
X    /* init error_type */
X    *error_type = PATTERN_VALID;
X    
X    /* loop through pattern to EOS */
X    while( *p ) {
X
X        /* determine pattern type */
X        switch( *p ) {
X
X            /* check literal escape, it cannot be at end of pattern */
X            case '\\':
X                if( !*++p ) {
X                    *error_type = PATTERN_ESC;
X                    return FALSE;
X                }
X                p++;
X                break;
X
X            /* the [..] construct must be well formed */
X            case '[':
X                p++;
X
X                /* if the next character is ']' then bad pattern */
X                if ( *p == ']' ) {
X                    *error_type = PATTERN_EMPTY;
X                    return FALSE;
X                }
X                
X                /* if end of pattern here then bad pattern */
X                if ( !*p ) {
X                    *error_type = PATTERN_CLOSE;
X                    return FALSE;
X                }
X
X                /* loop to end of [..] construct */
X                while( *p != ']' ) {
X
X                    /* check for literal escape */
X                    if( *p == '\\' ) {
X                        p++;
X
X                        /* if end of pattern here then bad pattern */
X                        if ( !*p++ ) {
X                            *error_type = PATTERN_ESC;
X                            return FALSE;
X                        }
X                    }
X                    else
X                        p++;
X
X                    /* if end of pattern here then bad pattern */
X                    if ( !*p ) {
X                        *error_type = PATTERN_CLOSE;
X                        return FALSE;
X                    }
X
X                    /* if this a range */
X                    if( *p == '-' ) {
X
X                        /* we must have an end of range */
X                        if ( !*++p || *p == ']' ) {
X                            *error_type = PATTERN_RANGE;
X                            return FALSE;
X                        }
X                        else {
X
X                            /* check for literal escape */
X                            if( *p == '\\' )
X                                p++;
X
X                            /* if end of pattern here then bad pattern */
X                            if ( !*p++ ) {
X                                *error_type = PATTERN_ESC;
X                                return FALSE;
X                            }
X                        }
X                    }
X                }
X                break;
X
X            /* all other characters are valid pattern elements */
X            case '*':
X            case '?':
X            default:
X                p++;                              /* "normal" character */
X                break;
X         }
X     }
X
X     return TRUE;
X}
X
X
X/*----------------------------------------------------------------------------
X*
X*  Match the pattern PATTERN against the string TEXT;
X*
X*  returns MATCH_VALID if pattern matches, or an errorcode as follows
X*  otherwise:
X*
X*            MATCH_PATTERN  - bad pattern
X*            MATCH_LITERAL  - match failure on literal mismatch
X*            MATCH_RANGE    - match failure on [..] construct
X*            MATCH_ABORT    - premature end of text string
X*            MATCH_END      - premature end of pattern string
X*            MATCH_VALID    - valid match
X*
X*
X*  A match means the entire string TEXT is used up in matching.
X*
X*  In the pattern string:
X*       `*' matches any sequence of characters (zero or more)
X*       `?' matches any character
X*       [SET] matches any character in the specified set,
X*       [!SET] or [^SET] matches any character not in the specified set.
X*
X*  A set is composed of characters or ranges; a range looks like
X*  character hyphen character (as in 0-9 or A-Z).  [0-9a-zA-Z_] is the
X*  minimal set of characters allowed in the [..] pattern construct.
X*  Other characters are allowed (ie. 8 bit characters) if your system
X*  will support them.
X*
X*  To suppress the special syntactic significance of any of `[]*?!^-\',
X*  and match the character exactly, precede it with a `\'.
X*
X----------------------------------------------------------------------------*/
X
Xint matche ( register char *p, register char *t )
X{
X    register char range_start, range_end;  /* start and end in range */
X
X    BOOLEAN invert;             /* is this [..] or [!..] */
X    BOOLEAN member_match;       /* have I matched the [..] construct? */
X    BOOLEAN loop;               /* should I terminate? */
X
X    for ( ; *p; p++, t++ ) {
X
X        /* if this is the end of the text then this is the end of the match */
X        if (!*t) {
X            return ( *p == '*' && *++p == '\0' ) ? MATCH_VALID : MATCH_ABORT;
X        }
X
X        /* determine and react to pattern type */
X        switch ( *p ) {
X
X            /* single any character match */
X            case '?':
X                break;
X
X            /* multiple any character match */
X            case '*':
X                return matche_after_star (p, t);
X
X            /* [..] construct, single member/exclusion character match */
X            case '[': {
X
X                /* move to beginning of range */
X                p++;
X
X                /* check if this is a member match or exclusion match */
X                invert = FALSE;
X                if ( *p == '!' || *p == '^') {
X                    invert = TRUE;
X                    p++;
X                }
X
X                /* if closing bracket here or at range start then we have a
X                   malformed pattern */
X                if ( *p == ']' ) {
X                    return MATCH_PATTERN;
X                }
X
X                member_match = FALSE;
X                loop = TRUE;
X
X                while ( loop ) {
X
X                    /* if end of construct then loop is done */
X                    if (*p == ']') {
X                        loop = FALSE;
X                        continue;
X                    }
X
X                    /* matching a '!', '^', '-', '\' or a ']' */
X                    if ( *p == '\\' ) {
X                        range_start = range_end = *++p;
X                    }
X                    else {
X                        range_start = range_end = *p;
X                    }
X
X                    /* if end of pattern then bad pattern (Missing ']') */
X                    if (!*p)
X                        return MATCH_PATTERN;
X
X                    /* check for range bar */
X                    if (*++p == '-') {
X
X                        /* get the range end */
X                        range_end = *++p;
X
X                        /* if end of pattern or construct then bad pattern */
X                        if (range_end == '\0' || range_end == ']')
X                            return MATCH_PATTERN;
X
X                        /* special character range end */
X                        if (range_end == '\\') {
X                            range_end = *++p;
X
X                            /* if end of text then we have a bad pattern */
X                            if (!range_end)
X                                return MATCH_PATTERN;
X                        }
X
X                        /* move just beyond this range */
X                        p++;
X                    }
X
X                    /* if the text character is in range then match found.
X                       make sure the range letters have the proper
X                       relationship to one another before comparison */
X                    if ( range_start < range_end  ) {
X                        if (*t >= range_start && *t <= range_end) {
X                            member_match = TRUE;
X                            loop = FALSE;
X                        }
X                    }
X                    else {
X                        if (*t >= range_end && *t <= range_start) {
X                            member_match = TRUE;
X                            loop = FALSE;
X                        }
X                    }
X                }
X
X                /* if there was a match in an exclusion set then no match */
X                /* if there was no match in a member set then no match */
X                if ((invert && member_match) ||
X                   !(invert || member_match))
X                    return MATCH_RANGE;
X
X                /* if this is not an exclusion then skip the rest of the [...]
X                    construct that already matched. */
X                if (member_match) {
X                    while (*p != ']') {
X
X                        /* bad pattern (Missing ']') */
X                        if (!*p)
X                            return MATCH_PATTERN;
X
X                        /* skip exact match */
X                        if (*p == '\\') {
X                            p++;
X
X                            /* if end of text then we have a bad pattern */
X                            if (!*p)
X                                return MATCH_PATTERN;
X                        }
X
X                        /* move to next pattern char */
X                        p++;
X                    }
X                }
X
X                break;
X            }
X
X            /* next character is quoted and must match exactly */
X            case '\\':
X
X                /* move pattern pointer to quoted char and fall through */
X                p++;
X
X                /* if end of text then we have a bad pattern */
X                if (!*p)
X                    return MATCH_PATTERN;
X
X            /* must match this character exactly */
X            default:
X                if (*p != *t)
X                    return MATCH_LITERAL;
X        }
X    }
X
X    /* if end of text not reached then the pattern fails */
X    if ( *t )
X        return MATCH_END;
X    else
X        return MATCH_VALID;
X}
X
X
X/*----------------------------------------------------------------------------
X*
X* recursively call matche() with final segment of PATTERN and of TEXT.
X*
X----------------------------------------------------------------------------*/
X
Xint matche_after_star (register char *p, register char *t)
X{
X    register int match = 0;
X    register nextp;
X
X    /* pass over existing ? and * in pattern */
X    while ( *p == '?' || *p == '*' ) {
X
X        /* take one char for each ? and + */
X        if ( *p == '?' ) {
X
X            /* if end of text then no match */
X            if ( !*t++ ) {
X                return MATCH_ABORT;
X            }
X        }
X
X        /* move to next char in pattern */
X        p++;
X    }
X
X    /* if end of pattern we have matched regardless of text left */
X    if ( !*p ) {
X        return MATCH_VALID;
X    }
X
X    /* get the next character to match which must be a literal or '[' */
X    nextp = *p;
X    if ( nextp == '\\' ) {
X        nextp = p[1];
X
X        /* if end of text then we have a bad pattern */
X        if (!nextp)
X            return MATCH_PATTERN;
X    }
X
X    /* Continue until we run out of text or definite result seen */
X    do {
X
X        /* a precondition for matching is that the next character
X           in the pattern match the next character in the text or that
X           the next pattern char is the beginning of a range.  Increment
X           text pointer as we go here */
X        if ( nextp == *t || nextp == '[' ) {
X            match = matche(p, t);
X        }
X
X        /* if the end of text is reached then no match */
X        if ( !*t++ ) match = MATCH_ABORT;
X
X    } while ( match != MATCH_VALID && 
X              match != MATCH_ABORT &&
X              match != MATCH_PATTERN);
X
X    /* return result */
X    return match;
X}
X
X
X/*----------------------------------------------------------------------------
X*
X* match() is a shell to matche() to return only BOOLEAN values.
X*
X----------------------------------------------------------------------------*/
X
XBOOLEAN match( char *p, char *t )
X{
X    int error_type;
X    error_type = matche(p,t);
X    return (error_type == MATCH_VALID ) ? TRUE : FALSE;
X}
X
X
X#ifdef TEST
X
X    /*
X    * This test main expects as first arg the pattern and as second arg
X    * the match string.  Output is yaeh or nay on match.  If nay on
X    * match then the error code is parsed and written.
X    */
X
X    #include <stdio.h>
X
X    int main(int argc, char *argv[])
X    {
X        int error;
X        int is_valid_error;
X
X        if (argc != 3) {
X            printf("Usage:  MATCH Pattern Text\n");
X        }
X        else {
X            printf("Pattern: %s\n", argv[1]);
X            printf("Text   : %s\n", argv[2]);
X            
X            if (!is_pattern(argv[1])) {
X                printf("    First Argument Is Not A Pattern\n");
X            }
X            else {
X                error = matche(argv[1],argv[2]);
X                is_valid_pattern(argv[1],&is_valid_error);
X
X                switch ( error ) {
X                    case MATCH_VALID:
X                        printf("    Match Successful");
X                        if (is_valid_error != PATTERN_VALID)
X                            printf(" -- is_valid_pattern() is complaining\n");
X                        else
X                            printf("\n");
X                        break;
X                    case MATCH_LITERAL:
X                        printf("    Match Failed on Literal\n");
X                        break;
X                    case MATCH_RANGE:
X                        printf("    Match Failed on [..]\n");
X                        break;
X                    case MATCH_ABORT:
X                        printf("    Match Failed on Early Text Termination\n");
X                        break;
X                    case MATCH_END:
X                        printf("    Match Failed on Early Pattern Termination\n");
X                        break;
X                    case MATCH_PATTERN:
X                        switch ( is_valid_error ) {
X                            case PATTERN_VALID:
X                                printf("    Internal Disagreement On Pattern\n");
X                                break;
X                            case PATTERN_ESC:
X                                printf("    Literal Escape at End of Pattern\n");
X                                break;
X                            case PATTERN_RANGE:
X                                printf("    No End of Range in [..] Construct\n");
X                                break;
X                            case PATTERN_CLOSE:
X                                printf("    [..] Construct is Open\n");
X                                break;
X                            case PATTERN_EMPTY:
X                                printf("    [..] Construct is Empty\n");
X                                break;
X                            default:
X                                printf("    Internal Error in is_valid_pattern()\n");
X                        }
X                        break;
X                    default:
X                        printf("    Internal Error in matche()\n");
X                        break;
X                }
X            }
X
X        }
X        return(0);
X    }
X
X#endif
END_OF_FILE
  if test 16844 -ne `wc -c <'match.c'`; then
    echo shar: \"'match.c'\" unpacked with wrong size!
  fi
  # end of 'match.c'
fi
if test -f 'match.doc' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'match.doc'\"
else
  echo shar: Extracting \"'match.doc'\" \(5288 characters\)
  sed "s/^X//" >'match.doc' <<'END_OF_FILE'
X
X                    REGEX Globber (Wild Card Matching)
X
X               A *IX SH style pattern matcher written in C
X                   V1.10 Dedicated to the Public Domain
X
X                                March  12, 1991
X                                 J. Kercheval
X                         [72450,3702] -- johnk@wrq.com
X
X
X
X
X*IX SH style Regular Expressions
X================================ 
X 
XThe *IX command SH is a working shell similar in feel to the MSDOS
Xshell COMMAND.COM.  In point of fact much of what we see in our
Xfamiliar DOS PROMPT was gleaned from the early UNIX shells available
Xfor many of machines the people involved in the computing arena had
Xat the time of the development of DOS and it's much maligned
Xprecursor CP/M (although the UNIX shells were and are much more
Xflexible and powerful then those on the current flock of micro
Xmachines).  The designers of DOS and CP/M did some fairly strange
Xthings with their command processor and OS.  One of those things was
Xto only selectively adopt the regular expressions allowed within the
X*IX shells.  Only '?' and '*' were allowed in filenames and even with
Xthese the '*' was allowed only at the end of a pattern and in fact
Xwhen used to specify the filename the '*' did not apply to extension.
XThis gave rise to the all too common expression "*.*".
X
XREGEX Globber is a SH pattern matcher.  This allows such
Xspecifications as *75.zip or * (equivelant to *.* in DOS lingo).
XExpressions such as [a-e]*t would fit the name "apple.crt" or
X"catspaw.bat" or "elegant".  This allows considerably wider
Xflexibility in file specification, general parsing or any other
Xcircumstance in which this type of pattern matching is wanted. 
X
XA match would mean that the entire string TEXT is used up in matching
Xthe PATTERN and conversely the matched TEXT uses up the entire
XPATTERN. 
X
XIn the specified pattern string:
X     `*' matches any sequence of characters (zero or more)
X     `?' matches any character
X     `\' suppresses syntactic significance of a special character
X     [SET] matches any character in the specified set,
X     [!SET] or [^SET] matches any character not in the specified set.
X
XA set is composed of characters or ranges; a range looks like
X'character hyphen character' (as in 0-9 or A-Z).  [0-9a-zA-Z_] is the
Xminimal set of characters allowed in the [..] pattern construct.
XOther characters are allowed (ie. 8 bit characters) if your system
Xwill support them (it almost certainly will).
X
XTo suppress the special syntactic significance of any of `[]*?!^-\',
Xand match the character exactly, precede it with a `\'.
X 
XTo view several examples of good and bad patterns and text see the
Xoutput of MATCHTST.BAT
X
X
X 
XMATCH() and MATCHE()
X====================
X
XThe match module as written has two parsing routines, one is matche()
Xand the other is match().  Since match() is a call to matche() which
Xsimply has its output mapped to a BOOLEAN value (ie TRUE if pattern
Xmatches or FALSE otherwise), I will concentrate my explanations here
Xon matche().
X
XThe purpose of matche() is to match a pattern against a string of
Xtext (usually a file name or specification).  The match routine has
Xextensive pattern validity checking built into it as part of the
Xparser and allows for a robust pattern match.
X
XThe parser gives an error code on return of type int.  The error code
Xwill be one of the the following defined values (defined in match.h):
X 
X    MATCH_PATTERN  - bad pattern or misformed pattern
X    MATCH_LITERAL  - match failed on character match (standard
X                     character)
X    MATCH_RANGE    - match failure on character range ([..] construct)
X    MATCH_ABORT    - premature end of text string (pattern longer
X                     than text string)
X    MATCH_END      - premature end of pattern string (text longer
X                     than pattern called for)
X    MATCH_VALID    - valid match using pattern
X
XThe functions are declared as follows:
X
X    BOOLEAN match (char *pattern, char *text);
X
X    int     matche(register char *pattern, register char *text);
X
X
X
XIS_VALID_PATTERN() and IS_PATTERN()
X===================================
X
XThere are two routines for determining properties of a pattern
Xstring.  The first, is_pattern(), is designed simply to determine if
Xsome character exists within the text which is consistent with a SH
Xregular expression (this function returns TRUE if so and FALSE if
Xnot).  The second, is_valid_pattern() is designed to check the
Xvalidity of a given pattern string (TRUE return if valid, FALSE if
Xnot).  By 'validity', I mean well formed or syntactically correct.
X
XIn addition, is_valid_pattern() has as one of it's parameters a
Xreturn code for determining the type of error found in the pattern if
Xone exists.  The error codes are as follows and defined in match.h:
X
X    PATTERN_VALID - pattern is well formed
X    PATTERN_ESC   - pattern has invalid literal escape ('\' at end of
X                    pattern)
X    PATTERN_RANGE - [..] construct has a no end range in a '-' pair
X                    (ie [a-])
X    PATTERN_CLOSE - [..] construct has no end bracket (ie [abc-g )
X    PATTERN_EMPTY - [..] construct is empty (ie [])
X
XThe functions are declared as follows:
X
X    BOOLEAN is_valid_pattern (char *pattern, int *error_type);
X
X    BOOLEAN is_pattern (char *pattern);
END_OF_FILE
  if test 5288 -ne `wc -c <'match.doc'`; then
    echo shar: \"'match.doc'\" unpacked with wrong size!
  fi
  # end of 'match.doc'
fi
if test -f 'match.h' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'match.h'\"
else
  echo shar: Extracting \"'match.h'\" \(3963 characters\)
  sed "s/^X//" >'match.h' <<'END_OF_FILE'
X/*
X EPSHeader
X
X   File: match.h
X   Author: J. Kercheval
X   Created: Sat, 01/05/1991  22:27:18
X*/
X/*
X EPSRevision History
X
X   J. Kercheval  Wed, 02/20/1991  22:28:37  Released to Public Domain
X   J. Kercheval  Sun, 03/10/1991  18:02:56  add is_valid_pattern
X   J. Kercheval  Sun, 03/10/1991  18:25:48  add error_type in is_valid_pattern
X   J. Kercheval  Sun, 03/10/1991  18:47:47  error return from matche()
X   J. Kercheval  Tue, 03/12/1991  22:24:49  Released as V1.1 to Public Domain
X*/
X
X/*
X   Wildcard Pattern Matching
X*/
X
X#ifndef BOOLEAN
X# define BOOLEAN int
X# define TRUE 1
X# define FALSE 0
X#endif
X
X/* match defines */
X#define MATCH_PATTERN  6    /* bad pattern */
X#define MATCH_LITERAL  5    /* match failure on literal match */
X#define MATCH_RANGE    4    /* match failure on [..] construct */
X#define MATCH_ABORT    3    /* premature end of text string */
X#define MATCH_END      2    /* premature end of pattern string */
X#define MATCH_VALID    1    /* valid match */
X
X/* pattern defines */
X#define PATTERN_VALID  0    /* valid pattern */
X#define PATTERN_ESC   -1    /* literal escape at end of pattern */
X#define PATTERN_RANGE -2    /* malformed range in [..] construct */
X#define PATTERN_CLOSE -3    /* no end bracket in [..] construct */
X#define PATTERN_EMPTY -4    /* [..] contstruct is empty */
X
X/*----------------------------------------------------------------------------
X*
X*  Match the pattern PATTERN against the string TEXT;
X*
X*       match() returns TRUE if pattern matches, FALSE otherwise.
X*       matche() returns MATCH_VALID if pattern matches, or an errorcode
X*           as follows otherwise:
X*
X*            MATCH_PATTERN  - bad pattern
X*            MATCH_LITERAL  - match failure on literal mismatch
X*            MATCH_RANGE    - match failure on [..] construct
X*            MATCH_ABORT    - premature end of text string
X*            MATCH_END      - premature end of pattern string
X*            MATCH_VALID    - valid match
X*
X*
X*  A match means the entire string TEXT is used up in matching.
X*
X*  In the pattern string:
X*       `*' matches any sequence of characters (zero or more)
X*       `?' matches any character
X*       [SET] matches any character in the specified set,
X*       [!SET] or [^SET] matches any character not in the specified set.
X*
X*  A set is composed of characters or ranges; a range looks like
X*  character hyphen character (as in 0-9 or A-Z).  [0-9a-zA-Z_] is the
X*  minimal set of characters allowed in the [..] pattern construct.
X*  Other characters are allowed (ie. 8 bit characters) if your system
X*  will support them.
X*
X*  To suppress the special syntactic significance of any of `[]*?!^-\',
X*  and match the character exactly, precede it with a `\'.
X*
X----------------------------------------------------------------------------*/
X
XBOOLEAN match (char *pattern, char *text);
X
Xint     matche(register char *pattern, register char *text);
X
X/*----------------------------------------------------------------------------
X*
X* Return TRUE if PATTERN has any special wildcard characters
X*
X----------------------------------------------------------------------------*/
X
XBOOLEAN is_pattern (char *pattern);
X
X/*----------------------------------------------------------------------------
X*
X* Return TRUE if PATTERN has is a well formed regular expression according
X* to the above syntax
X*
X* error_type is a return code based on the type of pattern error.  Zero is
X* returned in error_type if the pattern is a valid one.  error_type return
X* values are as follows:
X*
X*   PATTERN_VALID - pattern is well formed
X*   PATTERN_ESC   - pattern has invalid escape ('\' at end of pattern)
X*   PATTERN_RANGE - [..] construct has a no end range in a '-' pair (ie [a-])
X*   PATTERN_CLOSE - [..] construct has no end bracket (ie [abc-g )
X*   PATTERN_EMPTY - [..] construct is empty (ie [])
X*
X----------------------------------------------------------------------------*/
X
XBOOLEAN is_valid_pattern (char *pattern, int *error_type);
END_OF_FILE
  if test 3963 -ne `wc -c <'match.h'`; then
    echo shar: \"'match.h'\" unpacked with wrong size!
  fi
  # end of 'match.h'
fi
if test -f 'matchmak' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'matchmak'\"
else
  echo shar: Extracting \"'matchmak'\" \(396 characters\)
  sed "s/^X//" >'matchmak' <<'END_OF_FILE'
X#
X#
X# Makefile for match.c
X#
X# Created 01-20-91 JBK
X# Last Modified 02-13-91 JBK
X#
X#
X
XCC = cl
X
X#
X# This is FLAGS for optimized version
X#FLAGS = /c /AL /G2 /Ox /W4
X
X# 
X# This is FLAGS for optimized version with main
XFLAGS = /D TEST /AL /G2 /Ox /W4
X
X#
X# This is FLAGS for debugging versions with main
X#FLAGS = /D TEST /AL /G2 /Od /W4 /Zi /qc
X
X
Xmatch.exe: match.c match.h
X    $(CC) $(FLAGS) match.c
END_OF_FILE
  if test 396 -ne `wc -c <'matchmak'`; then
    echo shar: \"'matchmak'\" unpacked with wrong size!
  fi
  # end of 'matchmak'
fi
if test -f 'matchtst.bat' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'matchtst.bat'\"
else
  echo shar: Extracting \"'matchtst.bat'\" \(3776 characters\)
  sed "s/^X//" >'matchtst.bat' <<'END_OF_FILE'
X@echo off
X
Xecho .
Xecho Beginning MATCHTST
Xecho .
Xecho Creating file TEST.OUT
Xecho .
X
XREM The following tests should match
X
Xmatch test?         testy                           > test.out
Xmatch test*         test                            >> test.out
Xmatch tes*t         test                            >> test.out
Xmatch *test         test                            >> test.out
Xmatch t*s*t         test                            >> test.out
Xmatch t*s*t         tesseract                       >> test.out
Xmatch t?s?          test                            >> test.out
Xmatch ?s*t          psyot                           >> test.out
Xmatch [a-z]s*t      asset                           >> test.out
Xmatch s[!gh]t       set                             >> test.out
Xmatch t[a-ce]st     test                            >> test.out
Xmatch tea[ea-c]up   teacup                          >> test.out
Xmatch [a-fh-z]*     jack                            >> test.out
Xmatch \i\**         i*hello                         >> test.out
Xmatch [\[-\]]       [                               >> test.out
Xmatch [a-z\\]       \                               >> test.out
Xmatch [a-z%_]       b                               >> test.out
Xmatch [\]]          ]                               >> test.out
Xmatch \i?*          itch                            >> test.out
Xmatch \i?*          it                              >> test.out
Xmatch ?*?*?t        test                            >> test.out
Xmatch ?*?*?*?*      test                            >> test.out
Xmatch *\]*\**\?*\[  ]this*is?atest[                 >> test.out
Xmatch [a-\\]*       at                              >> test.out
Xmatch [a-d\\-/]     c                               >> test.out
Xmatch *t?l*his      bright-land-high-and-his        >> test.out
X
X
XREM The following tests should fail
X
Xmatch test          test                            >> test.out
Xmatch \             test                            >> test.out
Xmatch tes\          test                            >> test.out
Xmatch t*s*t         texxeract                       >> test.out
Xmatch t?st          tst                             >> test.out
Xmatch test?         test                            >> test.out
Xmatch s[!e]t        set                             >> test.out
Xmatch []            ]                               >> test.out
Xmatch [             [                               >> test.out
Xmatch [\[-\]        [                               >> test.out
Xmatch [a            atest                           >> test.out
Xmatch [a-           atest                           >> test.out
Xmatch [a-z          atest                           >> test.out
Xmatch [a-]*         atest                           >> test.out
Xmatch [a-fh-z       jack                            >> test.out
Xmatch [a-fh-z\]     jack                            >> test.out
Xmatch [a-fh-z]      jack                            >> test.out
Xmatch ?*?*?t*?      test                            >> test.out
Xmatch *?????        test                            >> test.out
Xmatch *\            test                            >> test.out
Xmatch [a-e\         atest                           >> test.out
Xmatch [a-\          atest                           >> test.out
Xmatch [a-bd-\       atest                           >> test.out
Xmatch *?*?t?        test                            >> test.out
Xmatch t?            t                               >> test.out
Xmatch ??*t          step                            >> test.out
Xmatch [a-e]*[!t     eel                             >> test.out
Xmatch [\            test                            >> test.out
Xmatch \t[est        test                            >> test.out
Xmatch ?*[]          hello                           >> test.out
X
Xecho MATCHTST Complete
Xecho .
END_OF_FILE
  if test 3776 -ne `wc -c <'matchtst.bat'`; then
    echo shar: \"'matchtst.bat'\" unpacked with wrong size!
  fi
  # end of 'matchtst.bat'
fi
if test -f 'readme.doc' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'readme.doc'\"
else
  echo shar: Extracting \"'readme.doc'\" \(6018 characters\)
  sed "s/^X//" >'readme.doc' <<'END_OF_FILE'
X03-12-91
X
XThis is V1.1 of REGEX Globber.
X
X
X03-12-91
X
XI have made a few changes to the match module which do several
Xthings.  The first change is an increase in bad pattern detection
Xduring a match.  It was possible, in some very unlikely cases, to
Xcook up a pattern which should result in an early bad match but which
Xwould actually cause problems for the parser.  In particular, the
Xsubcase where the literal escape '\' within an open [..] construct at
Xthe end of a pattern would end up with incorrect results.  I
Xproceeded to create some of these patterns, added them to my test
Xbattery and dove straight in.
X
XIn the interim I came across a posting to CompuServe (SMATCH by Stan
XAderman) which attempted to create a completely non-recursive
Ximplementation of match (I am not sure this is possible without
Xexplicitly creating your own stack or it's equivelant, like a binary
Xtree :-{ ).  While the code could not correctly handle multiple '*'
Xcharacters in the pattern, there was a few interesting ideas in the
Xposting.  On some occasions, running match over and over would be
Xcounter productive, especially and in particular when you have a bad
Xpattern.  I have added a fast routine, is_valid_pattern(), to
Xdetermine if the current pattern is well formed which should address
Xthis situation.
X
XOne other idea which I unceremoniously lifted from SMATCH was (in
Xhindsight a pretty obvious feature) the return of a meaningful error
Xcode from both the pattern validity routine and from match() (which I
Xrenamed to matche()).
X
XI also took some time to experiment with some ways to cut some time
Xoff the routine.  Since this is a SH pattern matcher whose intent is
Xprimarily for shell functions, the changes could not be algorithmic
Xchanges which relied on speedup over large input.  The differences in
Xexecution time were not very significant, but I did manage to gain
Xapproximately 5%-10% speedup when I removed the literal escape ('\')
Xparsing and pattern error checking.  For those of you who want to use
Xthis for filename wildcard usage, I would recommend doing this since
Xyou should use is_valid_pattern and is_pattern before going out and
Xfinding filenames and the dos path delimiter defaults to the
Xcharacter used for the literal escape ('\') anyway (Note: I will be
Xsoon be releasing a *IX style file parser in the FINDFILE, FINDNEXT
Xflavor soon to a Public Domain archive near you :-) ).
X
XI also briefly toyed with adding a non-SH regex character '+' to this
Xmodule but removed it again.  It was a performance hit of a few
Xpercent and would be mostly unused in any event.  For those
Xinterested in such a feature, the changes are truly minimal.  The
Xrequired extra work is: 
X
X   1) One case statement each in is_pattern() and is_valid_pattern() 
X   2) One case statement in matche()
X   3) One addition to a while conditional in matche_after_star()
X   4) One addition to an if conditional in matche_after_star()
X
XHint:  The case statements are all "case '+'" and the conditionals
X       have "|| *p == '+' " added to them.
X
XI have also included a file (MATCH.DOC) which describes matches use and
Xbackground as well as a little about regular expressions.
X
X                                jbk
X
X02-24-91
X
XThis is V1.01 of REGEX Globber.
X
X
X02-22-91 Seattle, WA
X
XHmm. Choke. (Foot in mouth). After griping about buggy routines and
Xliterally seconds after posting this code the first time,  I received
Xa wonderful new test evaluation tool which allows you to perform
Xcoverage analysis during testing.  Sure enough I found that about
X25% of the paths in the program were never traversed in my current
Xtest battery.  After swallowing my (overly large) pride and coming
Xup with a test battery which covered the entire path of the program
XI found a couple of minor logic bugs involving literal escapes (\)
Xwithin other patterns (ie [..] and * sequences).  I have repackaged
Xthese routines and included also the makefile I use and the test
Xbattery I use to make things a bit easier.
X
X                                jbk
X
X02-20-91 Seattle, WA
X
XHere is a *IX wildcard globber I butchered, hacked and cajoled together
Xafter seeing and hearing about and becoming disgusted with several similar
Xroutines which had one or more of the following attributes:  slow, buggy,
Xrequired large levels of recursion on matches, required grotesque levels
Xof recursion on failing matches using '*', full of caveats about usability
Xor copyrights.
X
XI submit this without copyright and with the clear understanding that
Xthis code may be used by anyone, for any reason, with any modifications
Xand without any guarantees, warrantee or statements of usability of any
Xsort.
X
XHaving gotten those cow chips out of the way, these routines are fairly
Xwell tested and reasonably fast.  I have made an effort to fail on all
Xbad patterns and to quickly determine failing '*' patterns.  This parser
Xwill also do quite a bit of the '*' matching via quick linear loops versus
Xthe standard blind recursive descent.
X
XThis parser has been submitted to profilers at various stages of development
Xand has come through quite well.  If the last millisecond is important to
Xyou then some time can be shaved by using stack allocated variables in
Xplace of many of the pointer follows (which may be done fairly often) found
Xin regex_match and regex_match_after_star (ie *p, *t).
X
XNo attempt is made to provide general [pat,pat] comparisons.  The specific
Xsubcases supplied by these routines is [pat,text] which is sufficient
Xfor the large majority of cases (should you care).
X
XSince regex_match may return one of three different values depending upon
Xthe pattern and text I have made a simple shell for convenience (match()).
XAlso included is an is_pattern routine to quickly check a potential pattern
Xfor regex special characters.  I even placed this all in a header file for
Xyou lazy folks!
X
XHaving said all that, here is my own reinvention of the wheel.  Please
Xenjoy it's use and I hope it is of some help to those with need ....
X
X
X                                jbk
X
END_OF_FILE
  if test 6018 -ne `wc -c <'readme.doc'`; then
    echo shar: \"'readme.doc'\" unpacked with wrong size!
  fi
  # end of 'readme.doc'
fi
echo shar: End of archive 1 \(of 1\).
cp /dev/null ark1isdone
MISSING=""
for I in 1 ; do
    if test ! -f ark${I}isdone ; then
	MISSING="${MISSING} ${I}"
    fi
done
if test "${MISSING}" = "" ; then
    echo You have the archive.
    rm -f ark[1-9]isdone
else
    echo You still must unpack the following archives:
    echo "        " ${MISSING}
fi
exit 0
exit 0 # Just in case...
-- 
Kent Landfield                   INTERNET: kent@sparky.IMD.Sterling.COM
Sterling Software, IMD           UUCP:     uunet!sparky!kent
Phone:    (402) 291-8300         FAX:      (402) 291-4362
Please send comp.sources.misc-related mail to kent@uunet.uu.net.