hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff) (06/05/91)
I would like to talk with and and all regular expression gurus. It seems that the regular expressions that are supported by perl (and unix) do not accept any and all regular languages. If this is not true, I would love to learn how to do some of the things that I am trying to do. For example, how would you write a regular expression that accepts a string with "foo" iff "bar" is not also in the string. -- Karl Hakimian hakimian@yoda.eecs.wsu.edu
merlyn@iWarp.intel.com (Randal L. Schwartz) (06/05/91)
In article <1991Jun4.175023.6509@serval.net.wsu.edu>, hakimian@tek4 (Karl Hakimian - staff) writes: | For example, how would you write a regular expression that accepts a string | with "foo" iff "bar" is not also in the string. There's no trivial way to do it, unless you count if (/foo/ && !/bar/) { # yes! } else { # no (sigh) } as your (literal) solution. Regexs have a hard time saying "not this thing here", unless the "thing" is a single character. print "Just another Perl hacker," unless /foo/ && !/bar/ -- /=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\ | on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III | | merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn | \=Cute Quote: "Intel: putting the 'backward' in 'backward compatible'..."====/
Tom Christiansen <tchrist@convex.COM> (06/05/91)
From the keyboard of hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff): :For example, how would you write a regular expression that accepts a string :with "foo" iff "bar" is not also in the string. I'd probably write it as two pattern-matches, one negated: if (!/bar/ && /foo/) { or $ok = /foo/ && !/bar/; --tom -- Tom Christiansen tchrist@convex.com convex!tchrist "Perl is to sed as C is to assembly language." -me
wicklund@intellistor.com (Tom Wicklund) (06/06/91)
In <1991Jun4.175023.6509@serval.net.wsu.edu> hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff) writes: >For example, how would you write a regular expression that accepts a string >with "foo" iff "bar" is not also in the string. Something like (/foo/ && !/bar/) is simplest. A regular expression meeting this would be something like the following. I've tested it and can't find anyplace it doesn't work. If I missed a case, it can be added to the conditions, but it will get messier. Needless to say, the two pattern form above is MUCH more readable. /^([^b]|b+[^ab]|b+a+[^barf]|b+aa+r)*(foo|b+foo|b+a+foo)([^b]|b+[^ab]|b+a+[^bar]|b+aa+r)*$/
connolly@convex.convex.COM (Dan Connolly) (06/06/91)
In article <1991Jun5.183443.11524@intellistor.com>, wicklund@intellistor.com (Tom Wicklund) writes: |> In <1991Jun4.175023.6509@serval.net.wsu.edu> hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff) writes: |> |> >For example, how would you write a regular expression that accepts a string |> >with "foo" iff "bar" is not also in the string. |> |> What definition of "regular language" are you [Karl Hakimian] working from? The definition of regular language I've seen is: For an alphabet A, the regular languages over A are the empty language, empty-string language, the single-character-string languages, and all finite applications of union, concatenation, and kleene-star of other regualar languages. In unix-speak, the alphabet is the 127 ascii characters (1-128). The empty language is tough to come up with. What's a regexp that matches nothing? The empty-string language is /^$/. The single-character-string languages are /a/, /b/, /c/, /\\/, etc. The union of two languages A and B is /A|B/. The concatenation of two languages A and B is /AB/. The kleene star of A is /A*/. Aside from some details about ()'s, that would seem to me to fit the bill. Dan
marc@athena.mit.edu (Marc Horowitz) (06/06/91)
|> I would like to talk with and and all regular expression gurus. It |> seems that the regular expressions that are supported by perl (and |> unix) do not accept any and all regular languages. If this is not |> true, I would love to learn how to do some of the things that I am |> trying to do. According to "Elements of the Theory of Computation," by Lewis and Papadimitriou (the textbook for 6.045 at MIT), regular languages can be completely described by the following things: 1) the empty string 2) any character in the alphabet 3) the concatenation of two regular expressions 4) the union of two regular expressions 5) the Kleene star of a regular expression (the * in x* is a Kleene star. There, now you probably learned something :-) Regular expressions are always defined over a finite alphabet. Under unix, this alphabet is \000-\377. perl regexps (and all other unix regexps I know) are all just ways of stating the above rules for the convenience of the programmer. Constant strings are concatenations of characters. Specials like \s and \w are unions of charaters. [abc0-9] is a union of 13 characters. (re)? is the union of the empty string and re. (re){m,n} can be written as (re) concatenated m times, concatenated with (re)? concatenated (n-m) times. Every other regexp expression in unix can similarly be written as some combination of the five operations listed above. L&D also points out that regular expressions are closed under intersection and complementation. So, your example of a regular expression matching all strings not containing "bar" is possible, but it could be really hairy. So, the regexp you proposed would be the concatenation of not bar, foo, and not bar. This will be even hairier. Tom Wicklund has posted such a regular expression. It is easy to show that a regular expression exists; it is often not easy to actually construct that regular expression. In short, "if (/foo/ && !/bar/)" is the easiest (and fastest, and clearest) way to say what you want to say. Marc
wicklund@intellistor.com (Tom Wicklund) (06/06/91)
As I might have expected, the regexp I posted wasn't correct. The following seems closer. /^([^b]|b+[^ab]|b+a+[^barf]|(b|aa+|ab)+r)*(foo|[ba]+foo)([^b]|b+[^ab]|[ba]+[^bar]|(b|aa+|ab)+r)*[ba]*$/ Note that checking for not of a 2 character string has the form (x|y)z, while checking not a 3 character string above has the form (a|b|c|d)e. I imagine longer strings will grow exponentially, so in practice the "/foo/ && !/bar/" form is much more practical. I imagine the regexp definition could be expanded to include a not function (and it might not be too hard to implement from what I remember of looking at the PERL/GNU/Henry Spenser regular expression implementations). It's probably also possible to derive a means of inverting a regular expression, but I wouldn't want to try it.
jm36+@andrew.cmu.edu (John Gardiner Myers) (06/06/91)
In article <1991Jun4.175023.6509@serval.net.wsu.edu> hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff) writes: >For example, how would you write a regular expression that accepts a string >with "foo" iff "bar" is not also in the string. /^([^b]|b[^a]|ba[^r])*b?a?foo([^b]|b[^a]|ba[^r])*$/
tchrist@convex.COM (Tom Christiansen) (06/06/91)
From the keyboard of wicklund@intellistor.com (Tom Wicklund): :the PERL/GNU/Henry Spenser regular expression implementations). It's s/ser/cer/ even though I doubt whether Henry reads this forum (or at least will admit to doing so :-). --tom -- Tom Christiansen tchrist@convex.com convex!tchrist "Perl is to sed as C is to assembly language." -me
merlyn@iWarp.intel.com (Randal L. Schwartz) (06/07/91)
In article <1991Jun6.162325.18139@cs.cmu.edu>, jm36@hogtown (John Gardiner Myers) writes: | In article <1991Jun4.175023.6509@serval.net.wsu.edu> hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff) writes: | >For example, how would you write a regular expression that accepts a string | >with "foo" iff "bar" is not also in the string. | | /^([^b]|b[^a]|ba[^r])*b?a?foo([^b]|b[^a]|ba[^r])*$/ $_ = "bbar foo"; print "wrong" if /^([^b]|b[^a]|ba[^r])*b?a?foo([^b]|b[^a]|ba[^r])*$/; Writing "doesn't match" regexs is a very tricky art. No, I'm not trying to start a contest. You just have to remember that any part of your "not classes" could match something else of some significance in your string. It took me 10 seconds to come up with the string above. $_="Could I be Just another guy who takes Perl and hacks it?"; /(J\w+ \w+).*(P\w+).*(h\w*)s/; print "$1 $2 $3er,"; -- /=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\ | on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III | | merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn | \=Cute Quote: "Intel: putting the 'backward' in 'backward compatible'..."====/
worley@compass.com (Dale Worley) (06/07/91)
From: hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff) I would like to talk with and and all regular expression gurus. It seems that the regular expressions that are supported by perl (and unix) do not accept any and all regular languages. If this is not true, I would love to learn how to do some of the things that I am trying to do. Perl's regular expressions *do* accept all regular languages. It contains single characters, concatenation, *, and |, which is enough. The problem you're having is that there is no explicit "doesn't contain" operator in Perl. However, that's not needed -- you can always rephrase "doesn't contain" in terms of the other operators. The transformation is messy, though. See any book that discusses regular languages and DFAs. For example, how would you write a regular expression that accepts a string with "foo" iff "bar" is not also in the string. (I assume that you mean "accepts the strings that contain 'foo' and do not contain 'bar'".) Personally, I'd construct the DFAs that accept "^.*foo.*$" and "^.*bar.*$". Then I'd combine them using the standard "A intersect (complement B)" construction to get a DFA that accepted strings containing foo but not bar. Then I'd do the horrible transformation to turn that DFA into a regular expression. But there's probably a better way. Dale Worley Compass, Inc. worley@compass.com -- Take nothing by force that you don't want.
lbr@holos0.uucp (Len Reed) (06/07/91)
In article <1991Jun4.175023.6509@serval.net.wsu.edu> hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff) writes: >I would like to talk with and and all regular expression gurus. It seems that >the regular expressions that are supported by perl (and unix) do not accept >any and all regular languages. >For example, how would you write a regular expression that accepts a string >with "foo" iff "bar" is not also in the string. It's not meaningful to talk of the regular expressions supported by Unix since different Unix tools support different concepts of regular expressions. Grep does not support a full set of regular expressions; perl and egrep do. A regular is defined recursively as follows: 1) A single character from a given set. or 2) Concatenation of two regular expressions. or 3) Kleene closure of a regular expression. (The * operator, usually written in superscript in textbook descriptions.) or 4) Disjunction of two regular expressions. (The | operator.) Syntactically there must be a method of grouping, of course, to show the difference between a(bc)* and abc*. Since perl supports all of these, it is complete. Grep, ex, ed, and several others lack the disjunction operator. Just because you *can* do something doesn't make it easy, natural, or efficient, though. A bare-bones implementation of regular expressions will accept more languages than grep, but grep allows nice syntactic shortcuts like [a-z] and [^xyz]. Perl has these in spades. You sure don't want to have to say a|b|c|d|e|f .... Since perl's expressions *are* complete, you *can* express your problem if it's regular. Your problem didn't seem regular to me off hand, since it can't be easily expressed as a regular expression. But it makes for a simple finite automaton (my DFA solution had 9 states), so it is regular. In perl, you want /foo/ && !/bar/ Coercing this into one regular expression, while provably do-able, is ill advised. I lost interest real quick in hand converting my DFA to a regular expression. -- Len Reed Holos Software, Inc. Voice: (404) 496-1358 UUCP: ...!gatech!holos0!lbr
allbery@NCoast.ORG (Brandon S. Allbery KF8NH) (06/08/91)
As quoted from <1991Jun4.175023.6509@serval.net.wsu.edu> by hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff): +--------------- | I would like to talk with and and all regular expression gurus. It seems that | the regular expressions that are supported by perl (and unix) do not accept | any and all regular languages. If this is not true, I would love to learn how | to do some of the things that I am trying to do. +--------------- It's true, which is why Perl will never replace, say, Icon for *all* applications. +--------------- | For example, how would you write a regular expression that accepts a string | with "foo" iff "bar" is not also in the string. +--------------- You don't. Regexps are most capable in combination with other constructs; so, while you can't write a regexp that matches /foo/ iff it does not also match /bar/, you *can* say (in Perl): (($str !~ /bar/) && ($str =~ /foo/)). (I could argue that Perl is nothing more than an extended regexp parser, if I were feeling whimsical....) ++Brandon -- Me: Brandon S. Allbery KF8NH: DC to LIGHT! [44.70.4.88] Internet: allbery@NCoast.ORG Delphi: ALLBERY uunet!usenet.ins.cwru.edu!ncoast!allbery
schwartz@groucho.cs.psu.edu (Scott Schwartz) (06/10/91)
connolly@convex.convex.COM (Dan Connolly) writes: | What definition of "regular language" are you [Karl Hakimian] working from? | The definition of regular language I've seen is: It is a theorem that regular languages are closed under intersection and negation, which is what he was after. Unix regexps don't generally provide a syntax for those things, but that's just happenstance, especially considering that Thompson's 1968 paper talks providing those features.
colin@array.UUCP (Colin Plumb) (06/11/91)
In article <1991Jun8.145419.22522@NCoast.ORG> allbery@ncoast.ORG (Brandon S. Allbery KF8NH) writes: >Regexps are most capable in combination with other constructs; so, while you >can't write a regexp that matches /foo/ iff it does not also match /bar/, you >*can* say (in Perl): (($str !~ /bar/) && ($str =~ /foo/)). (I could argue >that Perl is nothing more than an extended regexp parser, if I were feeling >whimsical....) Sorry, Brandon, you need to re-read your formal languages textbook. The set of formal languages is closed under complemenbt and intersection as well as union (the | operator), kleene closure (the * operator) and concatenation (the null operator). The difference between two regular languages is itself a regular language. In other words, given any two, arbitrarily complex regular expressions A and B, there is a regular expression C that matches iff A matches and B does not. I've played with finite automaton tools that implement the difference operator. It just so happens that deterministic finite automata, nondeterministic FA's, regular expressions involving difference, intersection, etc., and regular expressions using only "", | and * are equivalent in power (there are algorithms for translating among them), so there's no theoretical need to provide more than one means of expressing what you want. It's sure convenient, though. -- -Colin