day@grand.UUCP (Dave Yost) (07/02/88)
I'm working on a spec for industrial-strength (regex in a drum?) pattern matching expressions way beyond what egrep can do, and I'd like to hear about any editors or other programs out there that have pattern matching search and replacement capabilities beyond the regular expressions found in the various unix programs. Thanks. --dave
liberte@m.cs.uiuc.edu (07/02/88)
I would like to have a more powerful negation operator. Currently, regexps allow negation of a signal set of chars. I would like negation of a whole subexpression. e.g. abc(def)^ would match a string that starts with abc but is NOT followed by def. I think "^" would be unambiguous as a postfix operator. "^" as beginning-of-line would only be valid as the first char of a subexpression. There is possible confusion with [^abc] as negation of a char set, but such use could be phased out since [abc]^ is equivalent. Since negation of a regular expression is a regular expression and all you have to do to a finite state machine is flip accepting and non-accepting states, this should be trivial. Why wasnt it done before? Dan LaLiberte uiucdcs!liberte liberte@cs.uiuc.edu liberte%a.cs.uiuc.edu@uiucvmd.bitnet
smking@lion.waterloo.edu (Scott M. King) (07/04/88)
In article <37200009@m.cs.uiuc.edu> liberte@m.cs.uiuc.edu writes: >e.g. abc(def)^ would match a string that starts with >abc but is NOT followed by def. If I have the pattern "abc(d.*e)^ghi", and the line "abcdxxxghi", then what gets matched, if anything? The abc part is easy. Then, the pattern matcher tries to match a d.*e but it doesn't find one. That means the (d.*e)^ is successful, but since it didn't find anything, it has to start looking for the ghi at "dxxxghi", and doesn't find it. Thus the match fails. So, your ^ operator must act like a *test* for anything but the pattern rather than a *match* for anything but the pattern. Ie., test for the pattern, but don't eat up any characters while testing. Note that this behaviour would mean that with the line "abcecf", the pattern "ab(cd)^c." would match "abce", not "abcecf". >There is possible confusion with [^abc] as negation of a char set, >but such use could be phased out since [abc]^ is equivalent. But, [abc]^ would really just test for anything but an a, b or c. It wouldn't also match it. Of course, being able to test for a subpattern without actually matching it would probably be handy on occasion. However, it would be much better to use another special character like ! to indicate that the pattern should only be tested. "(rexp)" would match "rexp", and "(rexp)!" would test for "rexp". I guess you could still use your ^ operator in conjunction with this to test for anything but a pattern. However, using (rexp)^ (without !) would either be undefined, or mean the same thing as (rexp)!^ -- Scott M. King
stuart@cs.rochester.edu (Stuart Friedberg) (07/05/88)
In article <7618@watdragon.waterloo.edu> Scott M. King writes: >If I have the pattern "abc(d.*e)^ghi", and the line "abcdxxxghi", then >what gets matched, if anything? The abc part is easy. Then, the pattern >matcher tries to match a d.*e but it doesn't find one. >That means the (d.*e)^ is successful, but since it didn't find anything, >it has to start looking for the ghi at "dxxxghi", and doesn't find it. >Thus the match fails. Perhaps the original suggestion wasn't absolutely clear. I can't decide between a couple of interpretations, but I don't agree with King when he says "That means the (d.*e)^ is successful, but since it didn't find anything, ..." The pattern DID find something. It found a string that was not a 'd' followed by anything followed by an 'e'. The hard part is deciding just *which* string it matched. In at least one interpretation, this pattern *does* match the input, because you can, in fact, find an "abc", followed by something that is not a "(d.*e)", followed by a "ghi". abcdxxxghi ^^^ matched by pattern "abc" ^^^^ matched by pattern "(d.*e)^" ^^^ matched by pattern "ghi" However, the conventional (but certainly not necessary) interpretation of regular expressions, at least in the Unix tool world, is that they match as much as possible. Under that interpretation, the pattern doesn't completely match the input, because the "(d.*e)^" matches too much, leaving nothing for the "ghi". abcdxxxghi ^^^ matched by pattern "abc" ^^^^^^^ matched by pattern "(d.*e)^" This is a problem. Accepting anything *but* a pattern in almost all cases accepts too much, under the usual interpretation. Negated patterns should *not* match the longest possible string for the reason given above, but neither should they match the shortest possible string, because "(d.*e)^" would match any two character string not equal to "de". Well, we could treat negated patterns "(d.*e)^" by looking for the next positive pattern "ghi", then testing the stuff we skip over for the undesirable pattern. That seems reasonable at first, but what do we do for input like "abcdxxxghixxxghi"? Do we have: abcdxxxghixxxghi ^^^ matched by "abc" ^^^^ matched by "(d.*e)^" ^^^ matched by "ghi" or, perhaps: abcdxxxghixxxghi ^^^ matched by "abc" ^^^^^^^^^^ matched by "(d.*e)^" ^^^ matched by "ghi" Do we find the first occurrence of "ghi"? The last? The first/last such that the preceding negated pattern succeeds? What do we do when a negated pattern is followed by another negated pattern? This starts to smack too much of non-determinism, which is fine in its place, but conventions enforcing determinism are quite useful in practice. Having said all this, I think that negated patterns could be quite useful. More than one, I have wanted to search for occurrences of "foo(XXX)", where "XXX" is not "bar". The proper *practical* convention for negated patterns should be driven how people intend to use them, and I suggest that people submit proposed applications. If we can't come up with more complicated examples, perhaps the shortest span such that the following pattern succeeds, with adjacent negated patterns illegal, is satisfactory. Stu Friedberg stuart@cs.rochester.edu {ames,cmcl2,rutgers}!rochester!edu
blandy@marduk.cs.cornell.edu (Jim Blandy) (07/05/88)
I think you guys are missing something important about regular expressions and the way they're matched; The rule is as follows: 1) If it is AT ALL possible for the pattern to match, it will. This means that a pattern like e*eeee will always match a string consisting of at least four e's; there's no danger of the * operator "overstepping its bounds." 2) If one is searching for a pattern match in a large string, (like, for example, a text buffer), then we are faced with the question "HOW MUCH did it match?" (This question is especially important in a search-and-replace situation; what are we to replace?) That's where the "longest possible" rule comes into play. So the "longest possible" rule will never mess up a match that would otherwise be successful... I do think negation is well-defined; using the proposed syntax, (pat)^ matches any string pat would not. Since the set of strings matched by pat is (presumably) well-defined, the set for (pat)^ is too. About the claim that "negation should be trivial, since it only entails flipping the accept/reject-ingness of the states in the automaton...": If you're using a DFA (deterministic finite automaton) to match your patterns, this is true. No problem. But that trick does not work for NFA's (non-deterministic FA). An NFA could be in any one of several states after a certain input, some accepting, some rejecting; the rule is, if any of the possible states are accepting states, the NFA accepts its input, i.e. matches the pattern. Okay, so now flip all your states; the NFA could be in some rejecting states and some accepting states; you're not excluding everything you matched before. You want to say "anything this matches, I don't," and that's not what happens. Well, so what? Well, most pattern-matching functions in editors simulate an NFA. All (I think?) the Unix editors do this. So negation is not a trivial thing. I can't think of any simple way to do it off the bat, but that's just me; maybe someone else? -- Jim Blandy - blandy@crnlcs.bitnet "insects were insects when man was just a burbling whatisit." - archie
blm@cxsea.UUCP (Brian Matthews) (07/05/88)
Stuart Friedberg (stuart@cs.rochester.edu) writes: [ discussion about matching the string abcdxxxghi against the pattern abc(d.*e)^ghi, where the ^ means the preceding subexpression should match strings that the subexpression alone wouldn't match ] |In at least one interpretation, this pattern *does* match the input, |because you can, in fact, find an "abc", followed by something that |is not a "(d.*e)", followed by a "ghi". | | abcdxxxghi | ^^^ matched by pattern "abc" | ^^^^ matched by pattern "(d.*e)^" | ^^^ matched by pattern "ghi" | |However, the conventional (but certainly not necessary) interpretation |of regular expressions, at least in the Unix tool world, is that they |match as much as possible. Actually, this is a problem with normal (without the added ^) expressions also. Consider matching abcdef by abc.*def. You can have: abcdef ^^^ matched by pattern "abc" ^^^ matched by pattern ".*" "def" doesn't match, so match fails. or: abcdef ^^^ matched by pattern "abc" ".*" matches the null string ^^^ matched by pattern "def" In Unix, this ambiguity is overcome by specifying that in addition to saying that a subexpression should match as much as possible, it must allow following subexpressions to match as well, meaning that the second case above is what happens, as well as things like: abcdefdef ^^^ matched by pattern "abc" ^^^ matched by pattern ".*" ^^^ matched by pattern "def" |Under that interpretation, the pattern |doesn't completely match the input, because the "(d.*e)^" matches |too much, leaving nothing for the "ghi". | | abcdxxxghi | ^^^ matched by pattern "abc" | ^^^^^^^ matched by pattern "(d.*e)^" The additional clause mean the (d.*e)^ matches the longest string it can while still allowing ghi to match, namely dxxx. | abcdxxxghixxxghi | ^^^ matched by "abc" | ^^^^ matched by "(d.*e)^" | ^^^ matched by "ghi" | |or, perhaps: | | abcdxxxghixxxghi | ^^^ matched by "abc" | ^^^^^^^^^^ matched by "(d.*e)^" | ^^^ matched by "ghi" The second case would be correct under the above convention, as dxxxxghixxx is the longest string (d.*e)^ matches, while still allowing ghi to match. |Having said all this, I think that negated patterns could be quite |useful. I do also, and have been thinking about how it should be properly done. A couple of other things I've been thinking about are being able to do a case insensitive match on a subexpression, and treating a subexpression as a string instead of a pattern. -- Brian L. Matthews blm@cxsea.UUCP ...{mnetor,uw-beaver!ssc-vax}!cxsea!blm +1 206 251 6811 Computer X Inc. - a division of Motorola New Enterprises
liberte@m.cs.uiuc.edu (07/06/88)
Since I didnt specify what is matched by the negation of a regular expression, I then got trapped by my own lack of clarity. (def)^ is *not* the same as [^def] for two reasons. Thinking they are the same was a dumb mistake. It seems the most reasonable match for the negation of a regular expression is nothing. To match something, you must use a positive regexp concatenated to the negation. Therefore, if you want to match "abc....ghi" but not "abcd..eghi" you could use "abc(d.*e)^.*ghi". So the interpretation is "the negation of a regular expression fails if the regular expression is found and is successful if the regular expression is not found; nothing is matched in either case." Any other interpretation seems to have difficulties, as Stu Friedberg outlined. Consider the negation of a regular expression that contains alternatives. Which of the alternatives should be "matched" if they all must fail in order for the pattern to succeed? Dan LaLiberte uiucdcs!liberte liberte@cs.uiuc.edu liberte%a.cs.uiuc.edu@uiucvmd.bitnet
smking@lion.waterloo.edu (Scott M. King) (07/07/88)
In article <18838@cornell.UUCP> blandy@cs.cornell.edu (Jim Blandy) writes: >I do think negation is well-defined; using the proposed syntax, (pat)^ >matches any string pat would not. Since the set of strings matched by >pat is (presumably) well-defined, the set for (pat)^ is too. Wrong. You have already seen many different interpretations of how (pat)^ could be defined, some clearly wrong. It is not obvious what some patterns (containing negation) should match on some strings. A definition for *what* gets matched by a normal regular expression (not *how* it gets matched) could be stated as follows. Given an arbitrary pattern and a line to search, there could be many *possible* matches of the pattern. Ie., there may be many substrings in the line that could be matched by the pattern. To choose what the ultimate match is, first determine the first character in the line that is part of a possible match. Then, discard any possible matches that do not start on this character. Finally, choose the longest substring of the possible matches that remain. This is the ultimate match. Now, what happens if we apply this algorithm to negated patterns? Take the pattern "(a.*b)^cd", and the line "awxybzcd". Well, the possible matches are "awxybzcd", "wxybzcd", "xybzcd", "ybzcd", "bzcd", "zcd" or "cd". The possible match that starts at the earliest character position is the entire line. This matches because "awxybz" is a possible match for "(a.*b)^". Ie., it is not "an a followed by anything followed by a b". Using your wording, (a.*b)^ matches any string (a.*b) would not, and (a.*b) would clearly not match all of "awxybz", so (a.*b)^ does match it. I think the most desirable result from the previous example would be no match. This follows if the definition of negated patterns is made in terms of the pattern being negated: In (rexp)^, match rexp as usual. Then, if rexp was found, cause (rexp)^ to fail. Otherwise, if rexp was not found, cause (rexp)^ to succeed. So, applying this definition, the pattern matcher would look for a.*b as usual, and since it finds "awxyb", (a.*b)^ fails, and so this branch of the state machine dies off. Then, a.*b can not be found starting at the second character in the line, so (a.*b)^ succeeds starting at "wxybzcd". Back to the question of *what* this success of (a.*b)^ actually matches. I still maintain that the only reasonable and consistent definition would cause such a success to match zero characters. Any other arbitrary rule defining what non-null set of characters gets matched is sure to be wrong in many cases. And besides, if the negated pattern matches zero characters, you can always explicitly say what to match. Eg: using my definition, I can write the pattern "abc((d*e)^...|(.*b)^d..e)fg", which means 'match abc. Then, if this is found, look for "d*e". If this fails, match any three characters. If "(d*e)^..." was not found, then look for ".*b". If this fails, match "d..e". If anything was matched after the "abc", then match "fg". Else fail. From my first example, "(a.*b)^cd", the "look for the "cd", and then test what comes in between" rule that someone suggested can be represented with my definition with the pattern "(a.*b)^.*cd". Anyway, if (rexp)^ only tests for rexp and inverts the result, then there should also be a pattern construct to test for a positive pattern. (I previously thought that the "!" in (rexp)! could mean "test for rexp". A slightly more mnemonic character is "?", so "(rexp)?" would test for rexp rather that match it.) -- Scott M. King
ge@hobbit.sci.kun.nl (Ge' Weijers) (07/07/88)
In article <18838@cornell.UUCP>, blandy@marduk.cs.cornell.edu (Jim Blandy) writes: > I do think negation is well-defined; using the proposed syntax, (pat)^ > matches any string pat would not. Since the set of strings matched by > pat is (presumably) well-defined, the set for (pat)^ is too. > > About the claim that "negation should be trivial, since it only entails > flipping the accept/reject-ingness of the states in the automaton...": > The complement of a regular expression is a regular expression, although the derivation is not quite trivial. Matching a (xyz)^ is possible in theory and in practice. So if someone can whip up some code to generate the complement..... -- Ge' Weijers, Informatics dept., Nijmegen University, the Netherlands UUCP: {uunet!,}mcvax!kunivv1!hobbit!ge
blandy@awamore.cs.cornell.edu (Jim Blandy) (07/20/88)
Thinking about pattern complements: In defining this, we should respect the automata theory behind regular expressions, since lots of the algorithms in use today make use of the nice properties of regular sets. A regular expression denotes a set of strings (a string matches the RE if it's in the set), so the complement of a RE is the RE denoting the complement of the original set. Definition: If R is a regular expression, (R)^ is a regular expression matching the complement of R; a string matches (R)^ iff that string does NOT match R. This doesn't work the way one might think it does - for example, the RE "d" denotes the set of strings { "d" }. The complement of this is NOT the set of one-character strings <> "d". It includes things like the null string. So the regexp "abc(d)^.*" (meaning "abc", not "d", any string) DOES match the string "abcd" : the subexpression "(d)^" matches the null string, and the ".*" matches the "d" at the end of "abcd" . However, this definition IS useful. I'm looking at a Technical Report describing the ANSI C grammar, and they've got a lex scanner here. The pattern they give to match comments is: "/*"([^*]*[*]+[^\*\/])*(^*]*[*]+\/) It's a real monstrosity, but necessary because a comment cannot include the string "*/" . Under the longest-match rule, the line /* a comment */ some C code /* another comment */ would scan as one long comment if they hadn't gone to the trouble of excluding it and used the more obvious pattern "/*" .* "*/" (spaces for clarity) Here's where negation comes in: if I've got it right, that monstrosity can be rewritten with the negation operator as "/*" (.* "*/" .*)^ "*/" which is a LOT easier to understand: the pattern .* "*/" .* matches any string containing "*/". So, its complement matches any string NOT containing "*/". So my comment pattern reads as "a comment introducer ("/*"), followed by a string not containing a comment end ("*/"), followed by a comment end." A big win for clarity. It just seemed to me that the issue needed a clear discussion about what the complement of a regular expression really is, and some examples of how it should work. I think it's a good idea. I think one could extend the Thompson construction (turning an RE into an NFA) to handle this, but it won't fit in without some modification... -- Jim Blandy - blandy@crnlcs.bitnet "insects were insects when man was just a burbling whatisit." - archy
liberte@m.cs.uiuc.edu (07/22/88)
Mike Haertel has written an e?grep for FSF. I mailed the comp.editors discussion on regexp negation to him. He asked me to post his response. Dan LaLiberte uiucdcs!liberte liberte@cs.uiuc.edu liberte%a.cs.uiuc.edu@uiucvmd.bitnet ------------------------------------- From: mike@wheaties.ai.mit.edu (Mike Haertel) Negation is a good idea if (1) we can decide exactly what it should mean and (2) how to implement it. Several weeks ago, just after I finished GNU e?grep, I gave some consideration to the problem. In your summary of the net traffic, there is some obvious dissension as to what precisely the semantics of negation would be. The only clear definition I could come up for a complement operator is as follows: if foo describes some regular set (of strings) R, then !(foo) describes the complement set ~R (with respect to the universe of all strings). This is well-defined, since the set of regular sets is closed under union, intersection, and complement. This definition is unlikely to do what people would (naively) expect in the simple cases, because of the typical semantics of "find the first and longest match, considering all possible substrings of the text." For example, in an editor, "!bar" would most likely match from the current position to the end of the buffer (unless the text from the current position to the end of the buffer were exactly "bar", in which case it would match "ba"). On the other hand, constructs such as "foo(!bar)baz" would be useful, but would usually require scanning to the end of the buffer (unless matching were restricted to one line at a time). Negation has been studied in the theoretical setting, along with intersection (ever wanted an "&" operator?). It can lead to bizarre consequences; in fact, the problem of building a complete DFA from a regexp including "|", "&", and "!" turns out to be one of the most intractable problems imaginable. "The Design and Analysis of Computer Algorithms," (by Aho, Hopcroft, and Ullman) contains a discussion of this in chapter 11, titled "Some Provably Intractable Problems." Nobody's suggestions about how to implement negation have been right. Right now I see an obvious way to do it, but (1) I might be wrong, and (2) it would require complete replacement of the routines I already have for building a DFA. Since negation doesn't actually expand the set of strings that can be described, I think other considerations (such as figuring out how to do Boyer-Moore style searches based on regexps and not just fixed strings) should be given higher priority. Right now I am working on diff and don't have any time for grep other than bug fixes. What all this adds up to is that I'm not going to do anything about this just now. After the summer is over and I'm back in school, or maybe when I go to graduate school, I hope to look further into the problem if somebody else hasn't already. GNU e?grep is available on prep.ai.mit.edu via anonymous ftp, in the directory /u/emacs/grep. jaw@ames.arc.nasa.gov as a version of it available in ~ftp/pub on that host that has some simple boyer-moore enhancements added. My original version tends to run slightly more than twice as fast as Unix egrep (the fastest of the Unix greps). With his enhancements, it can beat Vax microcode searching for fixed strings. You may wish to forward this to comp.editors as I do not have reliable access to news here. Also, I would appreciate it if you could forward copies of any future netnews traffic about this to me (mike@wheaties.ai.mit.edu).
peter@sugar.uu.net (Peter da Silva) (07/24/88)
In article <37200011@m.cs.uiuc.edu>, liberte@m.cs.uiuc.edu writes: > The only clear definition I could come up for a complement operator is > as follows: if foo describes some regular set (of strings) R, then > !(foo) describes the complement set ~R (with respect to the universe of > all strings). This is well-defined, since the set of regular sets is > closed under union, intersection, and complement. This doesn't seem intuitive to me. More useful would be the set of strings not containing R. I can't see any case why using your definition of ! you would ever use anything but "!(.*foo.*)". This could also be implemented by trying to match "A!BC" first to "A(.*)C", then checking that the intervening stuff didn't match ".*B.*". Of course, this (a) probably doesn't help you build a DFA and (b) could equally well be used for your setup. > This definition is unlikely to do what people would (naively) expect in > the simple cases, because of the typical semantics of "find the first > and longest match, considering all possible substrings of the text." Right, it fails the principle of least astonishment. Are there any general purpose tools that allow you to match more than a single line? -- Peter da Silva `-_-' peter@sugar.uu.net Have you hugged U your wolf today?