[comp.sources.wanted] Pattern matching

mhb@hcx.uucp (MARK H BOTNER) (05/18/89)

Can anybody help me with pattern matching?  I'm trying to match
a string against a pattern such as '*' or '*.*' in the same manner that
the 'ls' command uses.  I've tried regex and regexp and I've even downloaded 
regex.Z from uunet.uu.net, but none of these work like I want.  Actually
I probably only need the wildcard characters '*' and '?', but it wouldn't
hurt to have the full expression syntax that the 'ls' command uses. Thanks!
 
P.S. if you have any answer or a subroutine for me, please mail it to me.
My address is:

     mhb@cseg.uucp

gwyn@smoke.BRL.MIL (Doug Gwyn) (05/18/89)

In article <2414@cveg.uucp> mhb@hcx.uucp (MARK H BOTNER) writes:
-Can anybody help me with pattern matching?  I'm trying to match
-a string against a pattern such as '*' or '*.*' in the same manner that
-the 'ls' command uses.

That should be extremely easy, as "ls" does no pattern matching.

-I've tried regex and regexp and I've even downloaded regex.Z from
-uunet.uu.net, but none of these work like I want.

Gee, they ought to.

abcscnge@csuna.csun.edu (Scott "The Pseudo-Hacker" Neugroschl) (05/18/89)

In article <2414@cveg.uucp> mhb@hcx.uucp (MARK H BOTNER) writes:
}
}Can anybody help me with pattern matching?  I'm trying to match
}a string against a pattern such as '*' or '*.*' in the same manner that
}the 'ls' command uses.  I've tried regex and regexp and I've even downloaded 
}regex.Z from uunet.uu.net, but none of these work like I want.  Actually
}I probably only need the wildcard characters '*' and '?', but it wouldn't
}hurt to have the full expression syntax that the 'ls' command uses. Thanks!
} 

Your problem is that reg. exprs and the shell wildcarding are different.

Regular expressions:	.	-- match any single character
			*	-- 0 or more of the preceding regex.

Shell Wildcarding	?	-- match any single character
			*	-- match 0 or more characters.

The wildcarding that "ls uses" is not done by ls.  It is performed by
the shell prior to executing the ls command.  If I were in your shoes,
I would take the shell wildcard, and translate it to regex form:

	? --> .
	* --> .*

and then use the regex stuff.

-- 
Scott "The Pseudo-Hacker" Neugroschl
UUCP:  ...!sm.unisys.com!csun!csuna.csun.edu!abcscnge
-- Beat me, Whip me, make me code in Ada
-- Disclaimers?  We don't need no stinking disclaimers!!!

pokey@well.UUCP (Jef Poskanzer) (05/19/89)

In the referenced message, mhb@hcx.uucp (MARK H BOTNER) wrote:
}Can anybody help me with pattern matching?
}P.S. if you have any answer or a subroutine for me, please mail it to me.
}My address is:
}     mhb@cseg.uucp

That address is useless, so I'm posting this.  This is free software, no
copyright or trade secret restrictions whatsoever.
---
Jef

            Jef Poskanzer   jef@helios.ee.lbl.gov   ...well!pokey

/*
** Simple pattern matcher.  Does ?, *, and [], much like various shells.
**
** by Jef Poskanzer
*/

amatch( s, p )
register char *s, *p;
    {
    for ( ; *p != '\0'; p++, s++ )
	{
	if ( *p == '*' )
	    {
	    if ( *++p == '\0' )
		return 1;
	    for ( ; *s != '\0'; s++ )
		if ( amatch( s, p ) )
		    return 1;
	    return 0;
	    }
	if ( *s == '\0' )
	    return 0;
	if ( *p == '?' )
	    continue;
	if ( *p == '[' )
	    {
	    int negcc = 0, ccmatch = 0;
	    char prevc;
	    if ( *++p == '!' )
		{
		negcc = 1;
		p++;
		}
	    for ( ; *p != ']'; p++ )
		{
		if ( *p == '\0' )
		    {
		    fprintf( stderr, "amatch: missing ']'\n" );
		    return 0;
		    }
		if ( *p == '-' )
		    {
		    if ( prevc <= *s && *++p >= *s )
			ccmatch = 1;
		    }
		else if ( *p == *s )
		    ccmatch = 1;
		prevc = *p;
		}
	    if ( ( ccmatch && ! negcc ) || ( negcc && ! ccmatch ) )
		continue;
	    return 0;
	    }
	if ( *p != *s )
	    return 0;
	}
    return *s == '\0';
    }

doug@xdos.UUCP (Doug Merritt) (05/19/89)

In article <10288@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes:
>In article <2414@cveg.uucp> mhb@hcx.uucp (MARK H BOTNER) writes:
>>Can anybody help me with pattern matching?  I'm trying to match
>>a string against a pattern such as '*' or '*.*' in the same manner that
>>the 'ls' command uses.

>That should be extremely easy, as "ls" does no pattern matching.

Nitpick. Where's your ":-)" ?
Substitute "sh" for "ls". Besides, maybe he's using MS-DOS, in which
case it might really be "ls" that's doing the matching he's used to.

>>I've tried regex and regexp and I've even downloaded regex.Z from
>>uunet.uu.net, but none of these work like I want.

>Gee, they ought to.

Not really...as I recall, they implement ed/ex/vi/sed-syntax patterns
such as "." and ".*", where he wants "?" and "*".

Though you'd think it'd be easy to modify that regexp.Z source to use
the desired syntax.

I have a C routine that does pattern matching, but it, too, is
not strictly compatible with "sh" so that probably wouldn't help.
	Doug
-- 
Doug Merritt		{pyramid,apple}!xdos!doug
Member, Crusaders for a Better Tomorrow		Professional Wildeyed Visionary

koblas@mips.COM (David Koblas) (05/19/89)

In article <11733@well.UUCP> Jef Poskanzer <jef@helios.ee.lbl.gov> writes:
>In the referenced message, mhb@hcx.uucp (MARK H BOTNER) wrote:
>}Can anybody help me with pattern matching?
>}P.S. if you have any answer or a subroutine for me, please mail it to me.
>}My address is:
>}     mhb@cseg.uucp
>
>That address is useless, so I'm posting this.  This is free software, no
>copyright or trade secret restrictions whatsoever.

Similar to Jef Posaknzer's but understands '{' ... '}' syntax.

#! /bin/sh
# This is a shell archive.  Remove anything before this line, then feed it
# into a shell via "sh file" or similar.  To overwrite existing files,
# type "sh file -c".
# The tool that generated this appeared in the comp.sources.unix newsgroup;
# send mail to comp-sources-unix@uunet.uu.net if you want that tool.
# If this archive is complete, you will see the following message at the end:
#		"End of shell archive."
# Contents:  glob.c
# Wrapped by koblas@yoyodyne.mips.com on Fri May 19 09:04:13 1989
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'glob.c' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'glob.c'\"
else
echo shar: Extracting \"'glob.c'\" \(3831 characters\)
sed "s/^X//" >'glob.c' <<'END_OF_FILE'
X/*
X *  input: "str" string will attempted to be matched
X *
X *         "pattern" string with wildcards that will match against "str".
X *
X *   wild:
X *			'*'          = match 0 or more occurances of anything
X *			"[abc]"  	 = match anyof "abc" (ranges supported)
X *		    "{xx,yy,zz}" = match anyof "xx", "yy", or "zz"
X *          '?'          = match any character
X */
X
X#define	FALSE	0
X#define	TRUE	1
X
X#ifdef TEST
X
X#define TESTGLOB(str1,str2) \
X	printf("%s %s = %s\n",str1,str2,glob(str1,str2)?"TRUE":"FALSE")
X
Xmain()
X{
X	TESTGLOB("abcdefg","abcdefg");
X	TESTGLOB("abcdefg","a?cd?fg");
X	TESTGLOB("abcdefg","ab[cde]defg");
X	TESTGLOB("abcdefg","ab[a-z]defg");
X	TESTGLOB("abcdefg","ab[a-z]defg");
X	TESTGLOB("ab]defg","ab[a]c]defg");
X	TESTGLOB("ab]defg","ab[a\\]c]defg");
X	TESTGLOB("abcdefg","ab*fg");
X	TESTGLOB("./bc/def/gh/ij","*de*");
X	TESTGLOB("./der/den/deq/der/","*deq*");
X	TESTGLOB("./bc/def/gh/ij","*ij");
X	TESTGLOB("./bc/def/gh/ij","./*");
X	TESTGLOB("abcdef","*def");
X	TESTGLOB("abcdef","*abcdef");
X	TESTGLOB("abcdef","abc*");
X	TESTGLOB("abcdef","abcdef*");
X	TESTGLOB("abcdef","*?*{xxx,,yy}");
X	TESTGLOB("abcdef","abcde{f}");
X	TESTGLOB("abcdef","abcdef{xxx,,yyy}");
X	TESTGLOB("abcdef","abc{def,qwrx}");
X	TESTGLOB("abcdef","abc{ab,def,qwrx}");
X	TESTGLOB("abcdef","{naqrwer,fuwnwer,as,abc,a}{ab,def,qwrx}");
X}
X
X#endif
X
Xglob(str,pattern)
Xchar	*str,*pattern;
X{
X	char	c,*cp;
X	int		done=FALSE,ret_code,ok;
X
X	while ((*pattern != '\0') && (!done) && (((*str=='\0') &&
X			((*pattern=='{') || (*pattern=='*'))) || (*str!='\0'))) {
X		switch (*pattern) {
X		case '\\':
X			pattern++;
X			if (*pattern != '\0')
X				pattern++;
X			break;
X		case '*':
X			pattern++;
X			ret_code=FALSE;
X			while ((*str != '\0') && (!(ret_code=glob(str++,pattern))));
X			if (ret_code) {
X				while (*str != '\0') str++;
X				while (*pattern != '\0') pattern++;
X			}
X			break;
X		case '[':
X			pattern++;
Xrepeat:
X			if ((*pattern == '\0') || (*pattern == ']')) {
X				done=TRUE;
X				break;
X			} 
X			if (*pattern == '\\') {
X				pattern++;
X				if (*pattern == '\0') {
X					done=TRUE;
X					break;
X				}
X			}
X			if (*(pattern+1) == '-') {
X				c = *pattern;
X				pattern+=2;
X				if (*pattern == ']') {
X					done=TRUE;
X					break;
X				}
X				if (*pattern == '\\') {
X					pattern++;
X					if (*pattern == '\0') {
X						done=TRUE;
X						break;
X					}
X				}
X				if ((*str < c) || (*str > *pattern)) {
X					pattern++;
X					goto repeat;
X				} 
X			} else if (*pattern != *str) {
X				pattern++;
X				goto repeat;
X			}
X			pattern++;
X			while ((*pattern != ']') && (*pattern != '\0')) {
X				if ((*pattern == '\\') && (*(pattern+1) != '\0'))
X					pattern++;
X				pattern++;
X			}
X			if (*pattern != '\0') {
X				pattern++;
X				str++;
X			}
X			break;
X		case '?':
X			pattern++;
X			str++;
X			break;
X		case '{':	/*}*/
X			pattern++;
X/*{*/		while ((*pattern != '}') && (*pattern!='\0')) {
X				cp = str;
X				ok = TRUE;
X				while (ok && (*cp != '\0') && (*pattern!='\0') &&
X/*{*/				   (*pattern!=',') && (*pattern!='}')) {
X					if (*pattern == '\\')
X						pattern++;
X					ok=(*pattern == *cp);
X					cp++;
X					pattern++;
X				}
X				if (*pattern=='\0') {
X					ok=FALSE;
X					done=TRUE;
X					break;
X				} else if (ok) {
X					str=cp;
X/*{*/				while ((*pattern!='}') && (*pattern!='\0')) {
X						pattern++;
X						if (*pattern=='\\') {
X							pattern++;
X/*{*/						if (*pattern=='}')
X								pattern++;
X						}
X					}
X				} else {
X/*{*/				while ((*pattern!='}') && (*pattern!=',') &&
X						   (*pattern!='\0')) {
X						pattern++;
X						if (*pattern=='\\') {
X							pattern++;
X/*{*/						if ((*pattern=='}') || (*pattern==','))
X								pattern++;
X						}
X					}
X				}
X				if (*pattern!='\0')
X					pattern++;
X			}
X			break;
X		default:
X			if (*str == *pattern) {
X				str++;
X				pattern++;
X			} else {
X				done=TRUE;
X			}
X		}
X	}
X	while (*pattern == '*') pattern++;
X	return ((*str == '\0') && (*pattern == '\0'));
X}
END_OF_FILE
if test 3831 -ne `wc -c <'glob.c'`; then
    echo shar: \"'glob.c'\" unpacked with wrong size!
fi
# end of 'glob.c'
fi
echo shar: End of shell archive.
exit 0
-- 
name : David Koblas                  uucp  : {ames,decwrl}!mips!koblas 
place: MIPS Computers Systems        domain: koblas@mips.com
quote: "Time has little to do with infinity and jelly donuts."

faulkner@jmullins.harvard.edu (Don Faulkner) (05/20/89)

Any one thinking about regular expression handlers should really
take a look at the one from GNU EMACS --- it is VERY powerful, fast,
and very flexible (you can set it up with a translation table
to ignore case ...)  Works very well stand alone --- need only two
files: "regex.c" and "regex.h"

Best part:  syntax and capabilities are identical to those from GNU
EMACS (golly!)

--

 Don Faulkner                                       
 Building 1, Room 803
 Harvard University, School of Public Health
 665 Huntington Avenue
 Boston, MA  02115

 ARPA:      faulkner%jmullins@harvard.harvard.edu                
 BITNET:    faulkner@harvard
 Telephone: (617) 732-2297

ksb@j.cc.purdue.edu (Kevin Braunsdorf) (05/20/89)

In <mhb@hcx.uucp> (MARK H BOTNER) wrote:
>Can anybody help me with pattern matching?

In <11733@well.UUCP> jef@helios.ee.lbl.gov (Jef Poskanzer) wrote:
<some C source>

In <20021@yoyodyne.mips.com.mips.COM> koblas@mips.COM (David Koblas) wrote:
>Similar to Jef Posaknzer's but understands '{' ... '}' syntax.
<more C sourse>


Sadly neither of these two routines really emulate the shell filename
globbing.  Here is a list of (some) failed tests.

# pat		file		ans	comment
????             /foo             no   ? doesn't match a /
?*               /foo             no   ? doesn't match a /
/???????         /usr/foo         no   ? doesn't match a /
?                /                no   ? doesn't match a /
.?               ./               no   ? doesn't match a non-leading /
?                .                no   ? doesn't match a leading .
*                /                no   * doesn't match a /
*?               /foo             no   * doesn't match a /
/u/*             /u/a/foo         no   * doesn't match a /
/u*              /u/foo           no   * doesn't match a /
*                .                no   * doesn't match a .
[/,]foo          /foo             no   ranges do not match /
*                .ksb             no   * doesn't match leading .
/*b              /.ksb            no   * doesn't match leading .
?ksb             .ksb             no   ? doesn't match leading .
/?ksb            /.ksb            no   ? doesn't match leading .

# and one questionable rule....
//ksb             /ksb            yes  any # glob /'s matches a /
/a//b             /a/b            yes  any # glob /'s matches a /


For exmple:
	glob("/bin/mail", "*")
would return true under both of these; (IMHO) this is wrong.

[ We need at least:
	glob("/bin/mail", "/*/*")
  to really match this file.
]


BTW: (noting the ! lines below,) does anyone know why the shell does this?

	$ touch ga g. .a
	$ ls -a
	./	../	.a	g.	ga
	$ echo g?
	g. ga
	$ echo "g"?
	g. ga
	$ echo .?
	.. .a
!	$ echo "."?
!	.?
	$ exit

The `?' is not quoted in the last case... ?

I will post my glob routines (which I think work) soon.
--
"Time, time, time, see what's become of me..."
kayessbee, Kevin Braunsdorf, ksb@j.cc.purdue.edu, pur-ee!ksb, purdue!ksb

lance@hermix.UUCP (Lance Ellinghouse) (05/20/89)

In article <10288@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn) writes:
> In article <2414@cveg.uucp> mhb@hcx.uucp (MARK H BOTNER) writes:
> -Can anybody help me with pattern matching?  I'm trying to match
> -a string against a pattern such as '*' or '*.*' in the same manner that
> -the 'ls' command uses.
> 
> That should be extremely easy, as "ls" does no pattern matching.
> 
This is true. The shell does the expansion of wild card characters.
the program just gets a list.

-- 
Lance Ellinghouse
Mark V Systems, Ltd.
UUCP: ...!hermix!lance
ARPA: ucla-an!hermix!lance@ee.UCLA.EDU

dws@cseg.uucp (David W. Summers) (05/20/89)

In article <11733@well.UUCP>, pokey@well.UUCP (Jef Poskanzer) writes:
> In the referenced message, mhb@hcx.uucp (MARK H BOTNER) wrote:
> }My address is:
> }     mhb@cseg.uucp
> 
> That address is useless, so I'm posting this.  This is free software, no
> copyright or trade secret restrictions whatsoever.
> ---
> Jef
> 
>             Jef Poskanzer   jef@helios.ee.lbl.gov   ...well!pokey

Thank you for replying to Mark Botner.  He and I are working on some routines
to allow a BBS to do file matching on user entered patterns.

My question is:  Why is this addresses useless?  I noticed that you had the
the same return address in your 'From: ' header (specifically,
"pokey@well.uucp").  This is the same *type* of address as ours.  Even the
"well!pokey" is the same type of address, although your other address in your
signature is obviously a domain name type which is different.  I don't think
our address is really useless as I can mail to ANYONE with that type of 
address (as long as the routing maps are correct).

- David Summers
(dws@cseg.uucp) - The "curious about E-Mail" and "wanting to get it right" guy.

khera@juliet.cs.duke.edu (Vick Khera) (05/20/89)

In article <2428@cveg.uucp> dws@cseg.uucp (David W. Summers) writes:
>
>In article <11733@well.UUCP>, pokey@well.UUCP (Jef Poskanzer) writes:
>> In the referenced message, mhb@hcx.uucp (MARK H BOTNER) wrote:
>> }My address is:
>> }     mhb@cseg.uucp
>> 
>> That address is useless, so I'm posting this.  This is free software, no
>> ...
>> Jef
> ...
>My question is:  Why is this addresses useless?  I noticed that you had the
>...  I don't think
>our address is really useless as I can mail to ANYONE with that type of 
>address (as long as the routing maps are correct).
>
>- David Summers

well, if you try sending a reply to somebody with mail address of
linus!user, then where will it go? will it go to some machine in princeton
or at the university of maryland?  with a uucp address of just the machine
name, there will likely be duplicates of names such as linus, juliet, and
other literary names. for instance, i am on juliet.cs.duke.edu, and to get
uucp traffic here, it needs to be qualified with something like
mcnc!duke!juliet!khera since mcnc is a well known system.  mail to just
juliet!khera could end up on a different machine depending on where it is
mailed from. 

							vick.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
ARPA:	khera@cs.duke.edu		Department of Computer Science
CSNET:	khera@duke			Duke University
UUCP:	{mcnc,decvax}!duke!khera	Durham, NC 27706

880716a@aucs.uucp (Dave Astels) (05/29/90)

I am about to start developing pattern matching software using C++.
I have done something similar to what I plan a while ago using LISP. I
would like an algorithm that would be more efficient than the heavily
recursive LISP implementation.  It is to be used in a real-time situation,
to analyse user input.

Input is seen as a list of strings, each being one word.  The patterns are
lists of either strings (requiring a literal match) or sub-lists (commands).
Commands include:
	match any word (0 or one occurance)
	match any words (0 or more) and (1 or more)
	match the 3 cases above, but to match, input must be in a specified list.
	etc.

I would like any information, including pointers to references.
Example code would be appreciated as well.



-- 
- Dave

Internet: 880716a@AcadiaU.CA
Bitnet:   880716a@Acadia