[comp.editors] pattern matches

day@grand.UUCP (Dave Yost) (07/02/88)

I'm working on a spec for industrial-strength
(regex in a drum?) pattern matching
expressions way beyond what egrep can do, and
I'd like to hear about any editors or other
programs out there that have pattern matching
search and replacement capabilities beyond
the regular expressions found in the various
unix programs.

Thanks.

 --dave

liberte@m.cs.uiuc.edu (07/02/88)

I would like to have a more powerful negation operator.
Currently, regexps allow negation of a signal set of chars.
I would like negation of a whole subexpression.

e.g. abc(def)^ would match a string that starts with
abc but is NOT followed by def.

I think "^" would be unambiguous as a postfix operator.
"^" as beginning-of-line would only be
valid as the first char of a subexpression.
There is possible confusion with [^abc] as negation of a char set,
but such use could be phased out since [abc]^ is equivalent.

Since negation of a regular expression is a regular expression
and all you have to do to a finite state machine is flip
accepting and non-accepting states, this should be trivial.
Why wasnt it done before?

Dan LaLiberte
uiucdcs!liberte
liberte@cs.uiuc.edu
liberte%a.cs.uiuc.edu@uiucvmd.bitnet

smking@lion.waterloo.edu (Scott M. King) (07/04/88)

In article <37200009@m.cs.uiuc.edu> liberte@m.cs.uiuc.edu writes:
>e.g. abc(def)^ would match a string that starts with
>abc but is NOT followed by def.

If I have the pattern "abc(d.*e)^ghi", and the line "abcdxxxghi", then
what gets matched, if anything? The abc part is easy. Then, the pattern
matcher tries to match a d.*e but it doesn't find one.
That means the (d.*e)^ is successful, but since it didn't find anything,
it has to start looking for the ghi at "dxxxghi", and doesn't find it.
Thus the match fails.

So, your ^ operator must act like a *test* for anything but the pattern
rather than a *match* for anything but the pattern.
Ie., test for the pattern, but don't eat up any characters while testing.
Note that this behaviour would mean that with the line "abcecf",
the pattern "ab(cd)^c." would match "abce", not "abcecf".

>There is possible confusion with [^abc] as negation of a char set,
>but such use could be phased out since [abc]^ is equivalent.

But, [abc]^ would really just test for anything but an a, b or c.
It wouldn't also match it.

Of course, being able to test for a subpattern without actually
matching it would probably be handy on occasion. However,
it would be much better to use another special character
like ! to indicate that the pattern should only be tested.
"(rexp)" would match "rexp", and "(rexp)!" would test for "rexp".
I guess you could still use your ^ operator in conjunction with this
to test for anything but a pattern. However, using (rexp)^ (without !)
would either be undefined, or mean the same thing as (rexp)!^
--

Scott M. King

stuart@cs.rochester.edu (Stuart Friedberg) (07/05/88)

In article <7618@watdragon.waterloo.edu> Scott M. King writes:
>If I have the pattern "abc(d.*e)^ghi", and the line "abcdxxxghi", then
>what gets matched, if anything? The abc part is easy. Then, the pattern
>matcher tries to match a d.*e but it doesn't find one.
>That means the (d.*e)^ is successful, but since it didn't find anything,
>it has to start looking for the ghi at "dxxxghi", and doesn't find it.
>Thus the match fails.

Perhaps the original suggestion wasn't absolutely clear.  I can't
decide between a couple of interpretations, but I don't agree with King
when he says "That means the (d.*e)^ is successful, but since it didn't
find anything, ..."  The pattern DID find something.  It found a string
that was not a 'd' followed by anything followed by an 'e'.  The hard
part is deciding just *which* string it matched.

In at least one interpretation, this pattern *does* match the input,
because you can, in fact, find an "abc", followed by something that
is not a "(d.*e)", followed by a "ghi".

  abcdxxxghi
  ^^^		matched by pattern "abc"
     ^^^^	matched by pattern "(d.*e)^"
         ^^^	matched by pattern "ghi"

However, the conventional (but certainly not necessary) interpretation
of regular expressions, at least in the Unix tool world, is that they
match as much as possible.  Under that interpretation, the pattern
doesn't completely match the input, because the "(d.*e)^" matches
too much, leaving nothing for the "ghi".

  abcdxxxghi
  ^^^		matched by pattern "abc"
     ^^^^^^^	matched by pattern "(d.*e)^"

This is a problem.  Accepting anything *but* a pattern in almost all
cases accepts too much, under the usual interpretation.  Negated
patterns should *not* match the longest possible string for the reason
given above, but neither should they match the shortest possible
string, because "(d.*e)^" would match any two character string not
equal to "de".

Well, we could treat negated patterns "(d.*e)^" by looking for the next
positive pattern "ghi", then testing the stuff we skip over for the
undesirable pattern.  That seems reasonable at first, but what do we
do for input like "abcdxxxghixxxghi"?  Do we have:

  abcdxxxghixxxghi
  ^^^			matched by "abc"
     ^^^^		matched by "(d.*e)^"
         ^^^		matched by "ghi"

or, perhaps:

  abcdxxxghixxxghi
  ^^^			matched by "abc"
     ^^^^^^^^^^		matched by "(d.*e)^"
               ^^^	matched by "ghi"

Do we find the first occurrence of "ghi"?  The last?  The first/last
such that the preceding negated pattern succeeds?  What do we do when a
negated pattern is followed by another negated pattern?  This starts to
smack too much of non-determinism, which is fine in its place, but
conventions enforcing determinism are quite useful in practice.

Having said all this, I think that negated patterns could be quite
useful.  More than one, I have wanted to search for occurrences of
"foo(XXX)", where "XXX" is not "bar".  The proper *practical*
convention for negated patterns should be driven how people intend to
use them, and I suggest that people submit proposed applications.  If
we can't come up with more complicated examples, perhaps the shortest
span such that the following pattern succeeds, with adjacent negated
patterns illegal, is satisfactory.

Stu Friedberg  stuart@cs.rochester.edu  {ames,cmcl2,rutgers}!rochester!edu

blandy@marduk.cs.cornell.edu (Jim Blandy) (07/05/88)

I think you guys are missing something important about regular expressions
and the way they're matched;  The rule is as follows:

	1) If it is AT ALL possible for the pattern to match, it will.
	   This means that a pattern like e*eeee will always match a
	   string consisting of at least four e's; there's no danger of
	   the * operator "overstepping its bounds."
	2) If one is searching for a pattern match in a large string,
	   (like, for example, a text buffer), then we are faced with
	   the question "HOW MUCH did it match?"  (This question is
	   especially important in a search-and-replace situation;
	   what are we to replace?)  That's where the "longest possible"
	   rule comes into play.

So the "longest possible" rule will never mess up a match that would 
otherwise be successful...

I do think negation is well-defined; using the proposed syntax, (pat)^
matches any string pat would not.  Since the set of strings matched by
pat is (presumably) well-defined, the set for (pat)^ is too.

About the claim that "negation should be trivial, since it only entails
flipping the accept/reject-ingness of the states in the automaton...":

If you're using a DFA (deterministic finite automaton) to match your
patterns, this is true.  No problem.  But that trick does not work
for NFA's (non-deterministic FA).  An NFA could be in any one
of several states after a certain input, some accepting, some rejecting;
the rule is, if any of the possible states are accepting states, the
NFA accepts its input, i.e. matches the pattern.  Okay, so now flip
all your states;  the NFA could be in some rejecting states and some
accepting states; you're not excluding everything you matched before.
You want to say "anything this matches, I don't," and that's not what happens.

Well, so what?  Well, most pattern-matching functions in editors simulate
an NFA.  All (I think?) the Unix editors do this.  So negation is not
a trivial thing.  I can't think of any simple way to do it off the bat,
but that's just me; maybe someone else?
--
Jim Blandy - blandy@crnlcs.bitnet
"insects were insects when man was just a burbling whatisit."  - archie

blm@cxsea.UUCP (Brian Matthews) (07/05/88)

Stuart Friedberg (stuart@cs.rochester.edu) writes:

[ discussion about matching the string abcdxxxghi against the pattern
abc(d.*e)^ghi, where the ^ means the preceding subexpression should match
strings that the subexpression alone wouldn't match ]

|In at least one interpretation, this pattern *does* match the input,
|because you can, in fact, find an "abc", followed by something that
|is not a "(d.*e)", followed by a "ghi".
|
|  abcdxxxghi
|  ^^^		matched by pattern "abc"
|     ^^^^	matched by pattern "(d.*e)^"
|         ^^^	matched by pattern "ghi"
|
|However, the conventional (but certainly not necessary) interpretation
|of regular expressions, at least in the Unix tool world, is that they
|match as much as possible.

Actually, this is a problem with normal (without the added ^) expressions
also.  Consider matching abcdef by abc.*def.  You can have:

    abcdef
    ^^^		matched by pattern "abc"
       ^^^	matched by pattern ".*"
		"def" doesn't match, so match fails.

or:

    abcdef
    ^^^		matched by pattern "abc"
		".*" matches the null string
       ^^^	matched by pattern "def"

In Unix, this ambiguity is overcome by specifying that in addition to
saying that a subexpression should match as much as possible, it must
allow following subexpressions to match as well, meaning that the second
case above is what happens, as well as things like:

    abcdefdef
    ^^^		matched by pattern "abc"
       ^^^	matched by pattern ".*"
          ^^^	matched by pattern "def"

|Under that interpretation, the pattern
|doesn't completely match the input, because the "(d.*e)^" matches
|too much, leaving nothing for the "ghi".
|
|  abcdxxxghi
|  ^^^		matched by pattern "abc"
|     ^^^^^^^	matched by pattern "(d.*e)^"

The additional clause mean the (d.*e)^ matches the longest string it can
while still allowing ghi to match, namely dxxx.

|  abcdxxxghixxxghi
|  ^^^			matched by "abc"
|     ^^^^		matched by "(d.*e)^"
|         ^^^		matched by "ghi"
|
|or, perhaps:
|
|  abcdxxxghixxxghi
|  ^^^			matched by "abc"
|     ^^^^^^^^^^		matched by "(d.*e)^"
|               ^^^	matched by "ghi"

The second case would be correct under the above convention, as
dxxxxghixxx is the longest string (d.*e)^ matches, while still allowing
ghi to match.

|Having said all this, I think that negated patterns could be quite
|useful.

I do also, and have been thinking about how it should be properly done.
A couple of other things I've been thinking about are being able to do a
case insensitive match on a subexpression, and treating a subexpression
as a string instead of a pattern.

-- 
Brian L. Matthews  blm@cxsea.UUCP   ...{mnetor,uw-beaver!ssc-vax}!cxsea!blm
+1 206 251 6811    Computer X Inc. - a division of Motorola New Enterprises

liberte@m.cs.uiuc.edu (07/06/88)

Since I didnt specify what is matched by the negation of a regular expression,
I then got trapped by my own lack of clarity.
(def)^ is *not* the same as [^def] for two reasons.
Thinking they are the same was a dumb mistake.

It seems the most reasonable match for the negation of a regular expression
is nothing.  To match something, you must use a positive regexp
concatenated to the negation.  Therefore, if you want to match
"abc....ghi" but not "abcd..eghi" you could use "abc(d.*e)^.*ghi".

So the interpretation is "the negation of a regular expression fails
if the regular expression is found and is successful if the regular
expression is not found; nothing is matched in either case."

Any other interpretation seems to have difficulties, as Stu Friedberg outlined.
Consider the negation of a regular expression that contains alternatives.
Which of the alternatives should be "matched" if they all must fail
in order for the pattern to succeed?

Dan LaLiberte
uiucdcs!liberte
liberte@cs.uiuc.edu
liberte%a.cs.uiuc.edu@uiucvmd.bitnet

smking@lion.waterloo.edu (Scott M. King) (07/07/88)

In article <18838@cornell.UUCP> blandy@cs.cornell.edu (Jim Blandy) writes:
>I do think negation is well-defined; using the proposed syntax, (pat)^
>matches any string pat would not.  Since the set of strings matched by
>pat is (presumably) well-defined, the set for (pat)^ is too.

Wrong. You have already seen many different interpretations of how (pat)^ 
could be defined, some clearly wrong. It is not obvious what some patterns
(containing negation) should match on some strings. 

A definition for *what* gets matched by a normal regular expression (not
*how* it gets matched) could be stated as follows.
Given an arbitrary pattern and a line to search, there could
be many *possible* matches of the pattern. Ie., there may be many substrings
in the line that could be matched by the pattern. To choose what the ultimate
match is, first determine the first character in the line that is part of a
possible match. Then, discard any possible matches that do not start on this
character. Finally, choose the longest substring of the possible matches
that remain. This is the ultimate match.

Now, what happens if we apply this algorithm to negated patterns?
Take the pattern "(a.*b)^cd", and the line "awxybzcd". Well, the
possible matches are "awxybzcd", "wxybzcd", "xybzcd", "ybzcd", "bzcd", "zcd"
or "cd". The possible match that starts at the earliest character position
is the entire line. This matches because "awxybz" is a possible match
for "(a.*b)^". Ie., it is not "an a followed by anything followed by a b".
Using your wording, (a.*b)^ matches any string (a.*b) would not, and
(a.*b) would clearly not match all of "awxybz", so (a.*b)^ does match it.

I think the most desirable result from the previous example would be
no match. This follows if the definition of negated patterns
is made in terms of the pattern being negated: In (rexp)^, match rexp
as usual. Then, if rexp was found, cause (rexp)^ to fail. Otherwise,
if rexp was not found, cause (rexp)^ to succeed.
So, applying this definition, the pattern matcher would look for a.*b
as usual, and since it finds "awxyb", (a.*b)^ fails, and so this
branch of the state machine dies off. Then, a.*b can not
be found starting at the second character in the line, so (a.*b)^
succeeds starting at "wxybzcd". Back to the question of *what*
this success of (a.*b)^ actually matches. I still maintain that the
only reasonable and consistent definition would cause
such a success to match zero characters. Any other arbitrary rule
defining what non-null set of characters gets matched is sure to
be wrong in many cases. And besides, if the negated pattern matches
zero characters, you can always explicitly say what to match.
Eg: using my definition, I can write the pattern "abc((d*e)^...|(.*b)^d..e)fg",
which means 'match abc. Then, if this is found, look for "d*e". If this fails,
match any three characters. If "(d*e)^..." was not found, then
look for ".*b". If this fails, match "d..e". If anything was matched
after the "abc", then match "fg". Else fail.
From my first example, "(a.*b)^cd", the "look for the "cd", and then test
what comes in between" rule that someone suggested can be represented
with my definition with the pattern "(a.*b)^.*cd".

Anyway, if (rexp)^ only tests for rexp and inverts the result, then
there should also be a pattern construct to test
for a positive pattern. (I previously thought that the "!" in (rexp)!
could mean "test for rexp". A slightly more mnemonic character is "?",
so "(rexp)?" would test for rexp rather that match it.)
--

Scott M. King

ge@hobbit.sci.kun.nl (Ge' Weijers) (07/07/88)

In article <18838@cornell.UUCP>, blandy@marduk.cs.cornell.edu (Jim Blandy) writes:
> I do think negation is well-defined; using the proposed syntax, (pat)^
> matches any string pat would not.  Since the set of strings matched by
> pat is (presumably) well-defined, the set for (pat)^ is too.
> 
> About the claim that "negation should be trivial, since it only entails
> flipping the accept/reject-ingness of the states in the automaton...":
> 

The complement of a regular expression is a regular expression, although the
derivation is not quite trivial. Matching a (xyz)^ is possible in theory
and in practice. So if someone can whip up some code to generate
the complement.....
-- 
Ge' Weijers, Informatics dept., Nijmegen University, the Netherlands
UUCP: {uunet!,}mcvax!kunivv1!hobbit!ge

blandy@awamore.cs.cornell.edu (Jim Blandy) (07/20/88)

Thinking about pattern complements:

In defining this, we should respect the automata theory behind regular
expressions, since lots of the algorithms in use today make use of the nice
properties of regular sets.

A regular expression denotes a set of strings (a string matches the RE if
it's in the set),  so the complement of a RE is the RE denoting the
complement of the original set.

Definition:  If R is a regular expression, (R)^ is a regular expression
         matching the complement of R;  a string matches (R)^ iff
         that string does NOT match R.

This doesn't work the way one might think it does - for example,
the RE "d" denotes the set of strings { "d" }.  The complement of
this is NOT the set of one-character strings <> "d".  It includes
things like the null string.  So the regexp "abc(d)^.*" (meaning
"abc", not "d", any string) DOES match the string "abcd" :  the
subexpression "(d)^" matches the null string, and the ".*" matches
the "d" at the end of "abcd" .

However, this definition IS useful.  I'm looking at a Technical
Report describing the ANSI C grammar, and they've got a lex scanner
here.  The pattern they give to match comments is:

    "/*"([^*]*[*]+[^\*\/])*(^*]*[*]+\/)

It's a real monstrosity, but necessary because a comment cannot
include the string "*/" .   Under the longest-match rule, the
line

    /* a comment */  some C code /* another comment */

would scan as one long comment if they hadn't gone to the trouble
of excluding it and used the more obvious pattern

    "/*" .* "*/"        (spaces for clarity)

Here's where negation comes in:  if I've got it right, that monstrosity
can be rewritten with the negation operator as

    "/*" (.* "*/" .*)^ "*/"

which is a LOT easier to understand:  the pattern

    .* "*/" .*

matches any string containing "*/".  So, its complement matches any
string NOT containing "*/".  So my comment pattern reads as "a comment
introducer ("/*"), followed by a string not containing a comment end
("*/"), followed by a comment end."  A big win for clarity.

It just seemed to me that the issue needed a clear discussion about
what the complement of a regular expression really is, and some
examples of how it should work.  I think it's a good idea.  I think one
could extend the Thompson construction (turning an RE into an NFA)
to handle this, but it won't fit in without some modification...
--
Jim Blandy - blandy@crnlcs.bitnet
"insects were insects when man was just a burbling whatisit."  - archy

liberte@m.cs.uiuc.edu (07/22/88)

Mike Haertel has written an e?grep for FSF.  I mailed the
comp.editors discussion on regexp negation to him.  He asked me to
post his response.


Dan LaLiberte
uiucdcs!liberte
liberte@cs.uiuc.edu
liberte%a.cs.uiuc.edu@uiucvmd.bitnet
-------------------------------------

From: mike@wheaties.ai.mit.edu (Mike Haertel)

Negation is a good idea if (1) we can decide exactly what it should
mean and (2) how to implement it.

Several weeks ago, just after I finished GNU e?grep, I gave some
consideration to the problem.  In your summary of the net traffic,
there is some obvious dissension as to what precisely the semantics of
negation would be.

The only clear definition I could come up for a complement operator is
as follows:  if foo describes some regular set (of strings) R, then
!(foo) describes the complement set ~R (with respect to the universe of
all strings).  This is well-defined, since the set of regular sets is
closed under union, intersection, and complement.

This definition is unlikely to do what people would (naively) expect in
the simple cases, because of the typical semantics of "find the first
and longest match, considering all possible substrings of the text."
For example, in an editor, "!bar" would most likely match from the
current position to the end of the buffer (unless the text from the
current position to the end of the buffer were exactly "bar", in which
case it would match "ba").  On the other hand, constructs such as
"foo(!bar)baz" would be useful, but would usually require scanning to
the end of the buffer (unless matching were restricted to one line at a
time).

Negation has been studied in the theoretical setting, along with
intersection (ever wanted an "&" operator?).  It can lead to bizarre
consequences; in fact, the problem of building a complete DFA from a
regexp including "|", "&", and "!" turns out to be one of the most
intractable problems imaginable.  "The Design and Analysis of Computer
Algorithms," (by Aho, Hopcroft, and Ullman) contains a discussion of
this in chapter 11, titled "Some Provably Intractable Problems."

Nobody's suggestions about how to implement negation have been right.
Right now I see an obvious way to do it, but (1) I might be wrong, and
(2) it would require complete replacement of the routines I already
have for building a DFA.  Since negation doesn't actually expand the
set of strings that can be described, I think other considerations
(such as figuring out how to do Boyer-Moore style searches based on
regexps and not just fixed strings) should be given higher priority.
Right now I am working on diff and don't have any time for grep other
than bug fixes.  What all this adds up to is that I'm not going to do
anything about this just now.  After the summer is over and I'm back in
school, or maybe when I go to graduate school, I hope to look further
into the problem if somebody else hasn't already.

GNU e?grep is available on prep.ai.mit.edu via anonymous ftp, in the
directory /u/emacs/grep.  jaw@ames.arc.nasa.gov as a version of it
available in ~ftp/pub on that host that has some simple boyer-moore
enhancements added.  My original version tends to run slightly more
than twice as fast as Unix egrep (the fastest of the Unix greps).  With
his enhancements, it can beat Vax microcode searching for fixed strings.

You may wish to forward this to comp.editors as I do not have reliable
access to news here.  Also, I would appreciate it if you could forward
copies of any future netnews traffic about this to me
(mike@wheaties.ai.mit.edu).

peter@sugar.uu.net (Peter da Silva) (07/24/88)

In article <37200011@m.cs.uiuc.edu>, liberte@m.cs.uiuc.edu writes:
> The only clear definition I could come up for a complement operator is
> as follows:  if foo describes some regular set (of strings) R, then
> !(foo) describes the complement set ~R (with respect to the universe of
> all strings).  This is well-defined, since the set of regular sets is
> closed under union, intersection, and complement.

This doesn't seem intuitive to me. More useful would be the set of strings
not containing R. I can't see any case why using your definition of !
you would ever use anything but "!(.*foo.*)".

This could also be implemented by trying to match "A!BC" first to
"A(.*)C", then checking that the intervening stuff didn't match
".*B.*". Of course, this (a) probably doesn't help you build a DFA
and (b) could equally well be used for your setup.

> This definition is unlikely to do what people would (naively) expect in
> the simple cases, because of the typical semantics of "find the first
> and longest match, considering all possible substrings of the text."

Right, it fails the principle of least astonishment.

Are there any general purpose tools that allow you to match more than a single
line?
-- 
		Peter da Silva  `-_-'  peter@sugar.uu.net
 		 Have you hugged  U  your wolf today?