[comp.lang.c] wildcard matching

Tony.Bielobockie@urchin.fidonet.org (Tony Bielobockie) (12/29/90)

To: Alex Martelli
 
AM> Optimizing this, and in particular eliminating the recursive call,
AM> is, on the other hand, a rather interesting exercise, but one I'd
AM> undertake only if wildcard-matching was proven-by-profiling to be
AM> a bottleneck in my application... which is rather unlikely, I think.
 
I think that the recursive call is unnessicary.  Consider that anything
trailing the '*' character is ignored anyway, at least in MS-DOS, and
OS/2. What need is there for further checking if a '*' character is
encountered?  Does that hold true in the UNIX would too?
 
The following function will return the same results, with the exceptional
case being when the '*' character is not the last character in the 
wildcard
string.  Plus it is a little speedier, due to the fact that no data is
copied from the original string.
 
int wildmatch( char *pat, char *str )
        {
        while ( (*pat != '*') && (*pat != 0) && (*str != 0) )
                {
                if ( (*pat != '?') && (*str != *pat) )
                        {
                        return 0;
                        }
                pat++;
                str++;
                if ( (!*pat && *str) || (!*str && *pat) )
                        {
                        return 0;
                        }
                }
        return( 1 );
        }
 

mat@mole-end.UUCP (Mark A Terribile) (01/01/91)

(On the subject of wildcard patterns whose match is pathalogical in time ...)

> I think that the recursive call is unnessicary.  Consider that anything
> trailing the '*' character is ignored anyway, at least in MS-DOS, and
> OS/2. What need is there for further checking if a '*' character is
> encountered?  Does that hold true in the UNIX would too?

In UNIX, the `*' is NOT limited to matching a whole name component, but
instead any sequence of characters, anywhere.  Thus, one can write

	a*b*c

If you want to guarantee efficient matching, then I think you have to generate
a DFSA.  If you want to come close, you can put some pathology checks in your
code (like `*' after `*').  And there are publicly available regex-like
matchers, most of which will want .* for a closure on an arbitrary character.
Some of them might not be too hard to change.
-- 

 (This man's opinions are his own.)
 From mole-end				Mark Terribile

staff@cadlab.sublink.ORG (Alex Martelli) (01/02/91)

Tony.Bielobockie@urchin.fidonet.org (Tony Bielobockie) writes:

>To: Alex Martelli
> 
>AM> Optimizing this, and in particular eliminating the recursive call,
>AM> is, on the other hand, a rather interesting exercise, but one I'd
>AM> undertake only if wildcard-matching was proven-by-profiling to be
>AM> a bottleneck in my application... which is rather unlikely, I think.
> 
>I think that the recursive call is unnessicary.  Consider that anything
>trailing the '*' character is ignored anyway, at least in MS-DOS, and
>OS/2. What need is there for further checking if a '*' character is
>encountered?  Does that hold true in the UNIX would too?

I know that MSDOS and OS/2 consider that, when typing "a*b" or "a*c",
the user was just being funny, and he _actually_ meant just "a*" in
either case... presumably the extra letter, or letters, having been
typed just to keep his or her fingers in exercise?-)

I'm also happy to relate, though, that _other_ operating systems DO
have a modicum of common sense in their wildcars semantics!
That is, one can use, for example, "a*b" to stand for all files with
names beginning with a and ending with b, with anything or nothing in
between.  This is true at least in Dec's VMS and IBM's VM/SP (CMS), as
well, of course, as in Unix.

If you write your own subroutines for matching with wildcards, you
can of course offer whatever semantics strike your fancy.  I can
hardly believe anybody would care to defend the absurd semantics
of MessyDOS in this respect, though!
-- 
Alex Martelli - CAD.LAB s.p.a., v. Stalingrado 53, Bologna, Italia
Email: (work:) staff@cadlab.sublink.org, (home:) alex@am.sublink.org
Phone: (work:) ++39 (51) 371099, (home:) ++39 (51) 250434; 
Fax: ++39 (51) 366964 (work only), Fidonet: 332/401.3 (home only).

scs@adam.mit.edu (Steve Summit) (01/02/91)

In article <4739.277BA2FB@urchin.fidonet.org> Tony.Bielobockie@urchin.fidonet.org (Tony Bielobockie) writes:
>I think that the recursive call is unnessicary.  Consider that anything
>trailing the '*' character is ignored anyway, at least in MS-DOS, and
>OS/2. What need is there for further checking if a '*' character is
>encountered?  Does that hold true in the UNIX would too?

Oh, dear.  While I don't believe that computer "science" is a
hard enough science that those without formal training in it
should be deprecated, it is nevertheless the case that those
whose only exposure to it is through MS-DOS or other Microsoft
"operating systems" should realize that those "systems" are so
mind-bogglingly behind the state-of-the-art that extrapolation
based on them is bound to be embarrassingly futile.  For a long
time I had assumed that DOS's notion that "anything trailing the
'*' character is ignored anyway" was an outright bug in its
implementation, because most DOS manuals imply that a * matches
anything anywhere, not just at the end.

I'll keep a firm rein on myself and not let this turn into an
all-out DOS flame.  Its wildcard matching "algorithm" is only one
of a bewildering number of appalling misfeatures.  I have never
even tried to figure out exactly how DOS fails to implement
wildcards correctly.  (Obviously whoever implemented them didn't
really know what they were good for, and was limiting himself to
assembly language programming on a spineless processor, where a
correct implementation might have been somewhat more difficult.)

In classic regular expressions, and in every computer system I
have ever used (needless to say this includes Unix), a pattern
can certainly continue past a '*'; the utility of doing so would
only be mysterious to someone with the misfortune to have first
encountered wildcards under a "system" with as trivialized and
impotent an implementation as our beloved DOS.

>The following function will return the same results, with the exceptional
>case being when the '*' character is not the last character in the
>wildcard string.

Please, don't _ever_ implement a wildcard matcher with this
"exception."  It is not an "exceptional case" for the '*' not
to be the last character in the pattern; it is an absolute
requirement, and to leave it out should never have been
contemplated.

                                            Steve Summit
                                            scs@adam.mit.edu

slh@wolf.cs.washington.edu (Scott Heyano) (01/03/91)

In article <579@cadlab.sublink.ORG> staff@cadlab.sublink.ORG (Alex Martelli) writes:
|Tony.Bielobockie@urchin.fidonet.org (Tony Bielobockie) writes:
|
[...]
|>trailing the '*' character is ignored anyway, at least in MS-DOS, and
|>OS/2. What need is there for further checking if a '*' character is
|>encountered?  Does that hold true in the UNIX would too?
|
|I know that MSDOS and OS/2 consider that, when typing "a*b" or "a*c",
|the user was just being funny, and he _actually_ meant just "a*" in
|either case... presumably the extra letter, or letters, having been
|typed just to keep his or her fingers in exercise?-)
|
	This is not true of OS/2.

staff@cadlab.sublink.ORG (Alex Martelli) (01/03/91)

slh@wolf.cs.washington.edu (Scott Heyano) writes:
:In article <579@cadlab.sublink.ORG> staff@cadlab.sublink.ORG (Alex Martelli) writes:
:|Tony.Bielobockie@urchin.fidonet.org (Tony Bielobockie) writes:
:|
:[...]
:|>trailing the '*' character is ignored anyway, at least in MS-DOS, and
:|>OS/2. What need is there for further checking if a '*' character is
:|>encountered?  Does that hold true in the UNIX would too?
:|
:|I know that MSDOS and OS/2 consider that, when typing "a*b" or "a*c",
:|the user was just being funny, and he _actually_ meant just "a*" in
:|either case... presumably the extra letter, or letters, having been
:|typed just to keep his or her fingers in exercise?-)
:|
:	This is not true of OS/2.

Oops, sorry, Scott is right, Tony was wrong, and I was silly for
accepting Tony's "and OS/2" bit at face value instead of checking!  I
*did* check, now, and at least CMD.EXE gets it right - DIR T*P, in
particular, only lists files whose names begin with T ***AND*** END WITH
P.  One more reson to do wildmatch() right, then!
-- 
Alex Martelli - CAD.LAB s.p.a., v. Stalingrado 53, Bologna, Italia
Email: (work:) staff@cadlab.sublink.org, (home:) alex@am.sublink.org
Phone: (work:) ++39 (51) 371099, (home:) ++39 (51) 250434; 
Fax: ++39 (51) 366964 (work only), Fidonet: 332/401.3 (home only).

Tony.Bielobockie@urchin.fidonet.org (Tony Bielobockie) (01/04/91)

Steve Summit writes:

 >Oh, dear.  While I don't believe that computer "science" is a
 >hard enough science that those without formal training in it
 >should be deprecated, it is nevertheless the case that those
 >whose only exposure to it is through MS-DOS or other Microsoft
 >"operating systems" should realize that those "systems" are so
 >mind-bogglingly behind the state-of-the-art that extrapolation
 >based on them is bound to be embarrassingly futile.  For a long
 >time I had assumed that DOS's notion that "anything trailing the
 >'*' character is ignored anyway" was an outright bug in its
 >implementation, because most DOS manuals imply that a * matches
 >anything anywhere, not just at the end.

 The only persons that I discount are those that have reached the level
 of pomposity that you have.  MS-DOS does not implement wildcard
 matching properly (noticed in retrospect), OS/2 does.

 I can see your distaste for MS-DOS, which is more of a file loader
 than an operating system.  What is your beef with OS/2?  Are you
 perhaps just an anti-MicroSoft bigot?

 Finally, why are you so rude?  This is to be a discussion group
 where an exchange of ideas and knowledge takes place.  It is not
 your personal message arena where you are free to insult, and attempt
 to intimidate your fellow man.  Intimidation may have worked for you
 in the past, it however, will not work on me, it only goes further in
 proving what a egotistical jackass that you are.

domo@tsa.co.uk (Dominic Dunlop) (01/04/91)

In article <579@cadlab.sublink.ORG> staff@cadlab.sublink.ORG (Alex Martelli)
writes:
> I know that MSDOS and OS/2 consider that, when typing "a*b" or "a*c",
> the user was just being funny, and he _actually_ meant just "a*" in
> either case... presumably the extra letter, or letters, having been
> typed just to keep his or her fingers in exercise?-)
> 
> I'm also happy to relate, though, that _other_ operating systems DO
> have a modicum of common sense in their wildcars semantics!
> That is, one can use, for example, "a*b" to stand for all files with
> names beginning with a and ending with b, with anything or nothing in
> between.  This is true at least in Dec's VMS and IBM's VM/SP (CMS), as
> well, of course, as in Unix.

For what it's worth (all contributions gratefully received), the
forthcoming POSIX 1003.2 shell and tools standard codifies UNIX-style
filename ``globbing'', including the System V-ish ``*[!c]'' to match, in
this example, any filename not ending in ``c''.  The standard will
specify two functions, glob() and globfree() which can be called from
the C language, ending today's unsatisfactory situation, where one must
either recode globbing or spawn a shell to do it.

How much all this helps users of other operating systems is open to
question...
-- 
Dominic Dunlop

karl@ima.isc.com (Karl Heuer) (01/05/91)

Ins <584@cadlab.sublink.ORG> staff@cadlab.sublink.ORG (Alex Martelli) writes:
>One more reson to do wildmatch() right, then!

But first, check whether it needs to be done at all; and if so, implement it
using the standard interface.  Many libraries, and at least one influential
standard, already contain "int gmatch(char *str, char *pat)".  It handles not
only "*" and "?" but also "[xyz]" and "[!xyz]" (yes, that's a bang, not a
circumflex; glob vs regexp strikes again).

Karl W. Z. Heuer (karl@ima.isc.com or uunet!ima!karl), The Walking Lint

mcdonald@aries.scs.uiuc.edu (Doug McDonald) (01/05/91)

>In article <579@cadlab.sublink.ORG> staff@cadlab.sublink.ORG (Alex Martelli)
>writes:
>> I know that MSDOS and OS/2 consider that, when typing "a*b" or "a*c",
>> the user was just being funny, and he _actually_ meant just "a*" in
>> either case... presumably the extra letter, or letters, having been
>> typed just to keep his or her fingers in exercise?-)
>> 
>> I'm also happy to relate, though, that _other_ operating systems DO
>> have a modicum of common sense in their wildcars semantics!
>> That is, one can use, for example, "a*b" to stand for all files with
>> names beginning with a and ending with b, with anything or nothing in
>> between.  This is true at least in Dec's VMS and IBM's VM/SP (CMS), as
>> well, of course, as in Unix.
>

Apparently the above author does not understand MS-DOS wildcards. MS-DOS
 DOES have more than a "modicum" of common sense in its wild card system.
It is very simple:

A filename has the form abcdefgh.ijk , i.e. one to 8 characters followed
by an optional dot and one to three characters. Wildcards operate
separately in the first and second parts. A "*" is a wildcard that
means "any character in that and any succeeding position". A "?" is a
wildcard that means "any character in just that one position". That
is what the manual says, and that is what MS-DOS itself does. 

For example, a?b means a file with three letters, no extension, and 
a and b in the first and third positions. a?b*.* means a file with a and
b in the first and third places, anything in the second place, and anything
at all in the fourth through 8th positions, and any extension, or no 
extension at all. 

This of course has nothing to do with the C language, except that
some people apparently can't write C programs for MS-DOS that do 
wildcards as they are defined for it.

Doug McDonald

scs@adam.mit.edu (Steve Summit) (01/05/91)

Flames, of course, beget flames, and I expected someone might
take issue with my recent bout of gratuitous DOS-bashing.
However, someone apparently thought I was flaming _him_, and
ragged on me rather publicly, so I suppose I sorta have to
respond publicly, too.  Sorry about this.

From article <5012.2783BE37@urchin.fidonet.org>:
> ...why are you so rude?  This is to be a discussion group
> where an exchange of ideas and knowledge takes place.  It is not
> your personal message arena where you are free to insult, and attempt
> to intimidate your fellow man.

Yes, it is a discussion group, and a fairly sophisticated and
literate one at that, so sometimes one has to read carefully to
avoid misunderstanding.  What I wrote was

> While I don't believe that computer "science" is a
> hard enough science that those without formal training in it
> should be deprecated, it is nevertheless the case that those
> whose only exposure to it is through MS-DOS or other Microsoft
> "operating systems" should realize that those "systems" are so
> mind-bogglingly behind the state-of-the-art that extrapolation
> based on them is bound to be embarrassingly futile.

That sentence is a bit on the long side, but it is not meant to
disparage anyone (outside of Microsoft).  I realized when I wrote
it that it might be read as a put-down of "those without formal
training," and I see now that I should have worried a bit more.

I _don't_ look down on people who use computers without having
had formal training.  In fact, the quotes around "science" are a
clue to my real attitude: since formal training in computer
"science" is no prerequisite for success, it is the "science"
which I look down on.  All I was saying was that if someone's
entire exposure to computers is through a PC, there is a lot that
person is missing out on.  This is not to demean that person; I
would be naive if I did not realize that there are orders of
magnitude more people whose only exposure to computers is through
PC's than there are people who are familiar with the "real"
computers I happen to prefer.

("Embarrassingly futile" was admittedly unnecessarily harsh.
However: if you try to extrapolate, based on MS-DOS, how a "real"
operating system might do something, you can easily embarrass
yourself, such as by suggesting "simplified" wildcard matching
algorithms patterned after DOS's simpleminded semantics.  I'm not
disparaging someone who makes such a mistake; the words
"embarrassingly futile" were, believe it or not, intended to be
basically sympathetic.)

> MS-DOS does not implement wildcard
> matching properly (noticed in retrospect), OS/2 does.
> I can see your distaste for MS-DOS, which is more of a file loader
> than an operating system.  What is your beef with OS/2?

I have never used OS/2 and know little about it.  I implied that
it shared DOS's crippled wildcard algorithm only because the
complainant originally said it did (in <4739.277BA2FB@urchin.fidonet.org>:
"Consider that anything trailing the '*' character is ignored
anyway, at least in MS-DOS, and OS/2").

> Are you perhaps just an anti-MicroSoft bigot?

That much I'll admit to.  I'd expound on this, but it'd be more
flamage having nothing to do with C.

To everyone else: I have been appalled at how easily silly,
pointless flames seem to be cropping up lately, in all manner of
newsgroups, and I apologize for having been part of one here.

                                            Steve Summit
                                            scs@adam.mit.edu