[comp.lang.perl] regular expressions

hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff) (06/05/91)

I would like to talk with and and all regular expression gurus. It seems that
the regular expressions that are supported by perl (and unix) do not accept
any and all regular languages. If this is not true, I would love to learn how
to do some of the things that I am trying to do.

For example, how would you write a regular expression that accepts a string
with "foo" iff "bar" is not also in the string.

-- 
Karl Hakimian
hakimian@yoda.eecs.wsu.edu

merlyn@iWarp.intel.com (Randal L. Schwartz) (06/05/91)

In article <1991Jun4.175023.6509@serval.net.wsu.edu>, hakimian@tek4 (Karl Hakimian - staff) writes:
| For example, how would you write a regular expression that accepts a string
| with "foo" iff "bar" is not also in the string.

There's no trivial way to do it, unless you count

	if (/foo/ && !/bar/) {
		# yes!
	} else {
		# no (sigh)
	}

as your (literal) solution.  Regexs have a hard time saying "not this
thing here", unless the "thing" is a single character.

print "Just another Perl hacker," unless /foo/ && !/bar/
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Intel: putting the 'backward' in 'backward compatible'..."====/

Tom Christiansen <tchrist@convex.COM> (06/05/91)

From the keyboard of hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff):
:For example, how would you write a regular expression that accepts a string
:with "foo" iff "bar" is not also in the string.

I'd probably write it as two pattern-matches, one negated:

    if (!/bar/ && /foo/) {

or 

    $ok = /foo/ && !/bar/;

--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
	    "Perl is to sed as C is to assembly language."  -me

wicklund@intellistor.com (Tom Wicklund) (06/06/91)

In <1991Jun4.175023.6509@serval.net.wsu.edu> hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff) writes:

>For example, how would you write a regular expression that accepts a string
>with "foo" iff "bar" is not also in the string.


Something like (/foo/ && !/bar/) is simplest.


A regular expression meeting this would be something like the
following.  I've tested it and can't find anyplace it doesn't work.
If I missed a case, it can be added to the conditions, but it will get
messier.  Needless to say, the two pattern form above is MUCH more readable.

/^([^b]|b+[^ab]|b+a+[^barf]|b+aa+r)*(foo|b+foo|b+a+foo)([^b]|b+[^ab]|b+a+[^bar]|b+aa+r)*$/

connolly@convex.convex.COM (Dan Connolly) (06/06/91)

In article <1991Jun5.183443.11524@intellistor.com>, wicklund@intellistor.com (Tom Wicklund) writes:
|> In <1991Jun4.175023.6509@serval.net.wsu.edu> hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff) writes:
|> 
|> >For example, how would you write a regular expression that accepts a string
|> >with "foo" iff "bar" is not also in the string.
|> 
|> 

What definition of "regular language" are you [Karl Hakimian] working from?

The definition of regular language I've seen is:
For an alphabet A, the regular languages over A are
the empty language, empty-string language, the single-character-string languages, and all finite applications of union, concatenation, and kleene-star
of other regualar languages.

In unix-speak, the alphabet is the 127 ascii characters (1-128).
The empty language is tough to come up with. What's a regexp that matches nothing?
The empty-string language is /^$/.
The single-character-string languages are /a/, /b/, /c/, /\\/, etc.
The union of two languages A and B is /A|B/.
The concatenation of two languages A and B is /AB/.
The kleene star of A is /A*/.
Aside from some details about ()'s, that would seem to me to fit the bill.

Dan

marc@athena.mit.edu (Marc Horowitz) (06/06/91)

|> I would like to talk with and and all regular expression gurus. It
|> seems that the regular expressions that are supported by perl (and
|> unix) do not accept any and all regular languages. If this is not
|> true, I would love to learn how to do some of the things that I am
|> trying to do.

According to "Elements of the Theory of Computation," by Lewis and
Papadimitriou (the textbook for 6.045 at MIT), regular languages can
be completely described by the following things:

1) the empty string
2) any character in the alphabet
3) the concatenation of two regular expressions
4) the union of two regular expressions
5) the Kleene star of a regular expression (the * in x* is a Kleene
   star.  There, now you probably learned something :-)

Regular expressions are always defined over a finite alphabet.  Under
unix, this alphabet is \000-\377.

perl regexps (and all other unix regexps I know) are all just ways of
stating the above rules for the convenience of the programmer.
Constant strings are concatenations of characters.  Specials like \s
and \w are unions of charaters.  [abc0-9] is a union of 13 characters.
(re)? is the union of the empty string and re.  (re){m,n} can be
written as (re) concatenated m times, concatenated with (re)?
concatenated (n-m) times.  Every other regexp expression in unix can
similarly be written as some combination of the five operations listed
above.

L&D also points out that regular expressions are closed under
intersection and complementation.  So, your example of a regular
expression matching all strings not containing "bar" is possible, but
it could be really hairy.  So, the regexp you proposed would be the
concatenation of not bar, foo, and not bar.  This will be even
hairier.  Tom Wicklund has posted such a regular expression.

It is easy to show that a regular expression exists; it is often not
easy to actually construct that regular expression.  In short,
"if (/foo/ && !/bar/)" is the easiest (and fastest, and clearest) way
to say what you want to say.

		Marc

wicklund@intellistor.com (Tom Wicklund) (06/06/91)

As I might have expected, the regexp I posted wasn't correct.  The
following seems closer.

/^([^b]|b+[^ab]|b+a+[^barf]|(b|aa+|ab)+r)*(foo|[ba]+foo)([^b]|b+[^ab]|[ba]+[^bar]|(b|aa+|ab)+r)*[ba]*$/

Note that checking for not of a 2 character string has the form
(x|y)z, while checking not a 3 character string above has the form
(a|b|c|d)e.  I imagine longer strings will grow exponentially, so in
practice the "/foo/ && !/bar/" form is much more practical.  I imagine
the regexp definition could be expanded to include a not function (and
it might not be too hard to implement from what I remember of looking at
the PERL/GNU/Henry Spenser regular expression implementations).  It's
probably also possible to derive a means of inverting a regular
expression, but I wouldn't want to try it.

jm36+@andrew.cmu.edu (John Gardiner Myers) (06/06/91)

In article <1991Jun4.175023.6509@serval.net.wsu.edu> hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff) writes:
>For example, how would you write a regular expression that accepts a string
>with "foo" iff "bar" is not also in the string.

 /^([^b]|b[^a]|ba[^r])*b?a?foo([^b]|b[^a]|ba[^r])*$/

tchrist@convex.COM (Tom Christiansen) (06/06/91)

From the keyboard of wicklund@intellistor.com (Tom Wicklund):
:the PERL/GNU/Henry Spenser regular expression implementations).  It's

    s/ser/cer/

even though I doubt whether Henry reads this forum (or at 
least will admit to doing so :-).

--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
	    "Perl is to sed as C is to assembly language."  -me

merlyn@iWarp.intel.com (Randal L. Schwartz) (06/07/91)

In article <1991Jun6.162325.18139@cs.cmu.edu>, jm36@hogtown (John Gardiner Myers) writes:
| In article <1991Jun4.175023.6509@serval.net.wsu.edu> hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff) writes:
| >For example, how would you write a regular expression that accepts a string
| >with "foo" iff "bar" is not also in the string.
| 
|  /^([^b]|b[^a]|ba[^r])*b?a?foo([^b]|b[^a]|ba[^r])*$/

$_ = "bbar foo";

print "wrong" if /^([^b]|b[^a]|ba[^r])*b?a?foo([^b]|b[^a]|ba[^r])*$/;

Writing "doesn't match" regexs is a very tricky art.  No, I'm not
trying to start a contest.  You just have to remember that any part of
your "not classes" could match something else of some significance in
your string.  It took me 10 seconds to come up with the string above.

$_="Could I be Just another guy who takes Perl and hacks it?"; /(J\w+ \w+).*(P\w+).*(h\w*)s/; print "$1 $2 $3er,";
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Intel: putting the 'backward' in 'backward compatible'..."====/

worley@compass.com (Dale Worley) (06/07/91)

   From: hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff)

   I would like to talk with and and all regular expression gurus. It seems that
   the regular expressions that are supported by perl (and unix) do not accept
   any and all regular languages. If this is not true, I would love to learn how
   to do some of the things that I am trying to do.

Perl's regular expressions *do* accept all regular languages.  It
contains single characters, concatenation, *, and |, which is enough.
The problem you're having is that there is no explicit "doesn't
contain" operator in Perl.  However, that's not needed -- you can
always rephrase "doesn't contain" in terms of the other operators.
The transformation is messy, though.  See any book that discusses
regular languages and DFAs.

   For example, how would you write a regular expression that accepts a string
   with "foo" iff "bar" is not also in the string.

(I assume that you mean "accepts the strings that contain 'foo' and do
not contain 'bar'".)

Personally, I'd construct the DFAs that accept "^.*foo.*$" and
"^.*bar.*$".  Then I'd combine them using the standard "A intersect
(complement B)" construction to get a DFA that accepted strings
containing foo but not bar.  Then I'd do the horrible transformation
to turn that DFA into a regular expression.

But there's probably a better way.

Dale Worley		Compass, Inc.			worley@compass.com
--
Take nothing by force that you don't want.

lbr@holos0.uucp (Len Reed) (06/07/91)

In article <1991Jun4.175023.6509@serval.net.wsu.edu> hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff) writes:
>I would like to talk with and and all regular expression gurus. It seems that
>the regular expressions that are supported by perl (and unix) do not accept
>any and all regular languages.

>For example, how would you write a regular expression that accepts a string
>with "foo" iff "bar" is not also in the string.

It's not meaningful to talk of the regular expressions supported by Unix
since different Unix tools support different concepts of regular expressions.
Grep does not support a full set of regular expressions; perl and egrep do.

A regular is defined recursively as follows:

1) A single character from a given set.
	or
2) Concatenation of two regular expressions.
	or
3) Kleene closure of a regular expression.  (The * operator, usually written
   in superscript in textbook descriptions.)

	or
4) Disjunction of two regular expressions.  (The | operator.)

Syntactically there must be a method of grouping, of course, to show
the difference between a(bc)* and abc*.

Since perl supports all of these, it is complete.  Grep, ex, ed, and
several others lack the disjunction operator.

Just because you *can* do something doesn't make it easy, natural, or
efficient, though.  A bare-bones implementation of regular expressions
will accept more languages than grep, but grep allows nice syntactic
shortcuts like [a-z] and [^xyz].  Perl has these in spades.  You sure
don't want to have to say a|b|c|d|e|f ....

Since perl's expressions *are* complete, you *can* express your problem
if it's regular.  Your problem didn't seem regular to me off hand, since it
can't be easily expressed as a regular expression.  But it makes for
a simple finite automaton (my DFA solution had 9 states), so it is regular.

In perl, you want
   /foo/ && !/bar/

Coercing this into one regular expression, while provably do-able, is
ill advised.  I lost interest real quick in hand converting my
DFA to a regular expression.
-- 
Len Reed
Holos Software, Inc.
Voice: (404) 496-1358
UUCP: ...!gatech!holos0!lbr

allbery@NCoast.ORG (Brandon S. Allbery KF8NH) (06/08/91)

As quoted from <1991Jun4.175023.6509@serval.net.wsu.edu> by hakimian@tek4.eecs.wsu.edu (Karl Hakimian - staff):
+---------------
| I would like to talk with and and all regular expression gurus. It seems that
| the regular expressions that are supported by perl (and unix) do not accept
| any and all regular languages. If this is not true, I would love to learn how
| to do some of the things that I am trying to do.
+---------------

It's true, which is why Perl will never replace, say, Icon for *all*
applications.

+---------------
| For example, how would you write a regular expression that accepts a string
| with "foo" iff "bar" is not also in the string.
+---------------

You don't.

Regexps are most capable in combination with other constructs; so, while you
can't write a regexp that matches /foo/ iff it does not also match /bar/, you
*can* say (in Perl):  (($str !~ /bar/) && ($str =~ /foo/)).  (I could argue
that Perl is nothing more than an extended regexp parser, if I were feeling
whimsical....)

++Brandon
-- 
Me: Brandon S. Allbery			 KF8NH: DC to LIGHT!  [44.70.4.88]
Internet: allbery@NCoast.ORG		 Delphi: ALLBERY
uunet!usenet.ins.cwru.edu!ncoast!allbery

schwartz@groucho.cs.psu.edu (Scott Schwartz) (06/10/91)

connolly@convex.convex.COM (Dan Connolly) writes:
| What definition of "regular language" are you [Karl Hakimian] working from?
| The definition of regular language I've seen is:

It is a theorem that regular languages are closed under intersection
and negation, which is what he was after.  Unix regexps don't
generally provide a syntax for those things, but that's just
happenstance, especially considering that Thompson's 1968 paper talks
providing those features.

colin@array.UUCP (Colin Plumb) (06/11/91)

In article <1991Jun8.145419.22522@NCoast.ORG> allbery@ncoast.ORG (Brandon S. Allbery KF8NH) writes:
>Regexps are most capable in combination with other constructs; so, while you
>can't write a regexp that matches /foo/ iff it does not also match /bar/, you
>*can* say (in Perl):  (($str !~ /bar/) && ($str =~ /foo/)).  (I could argue
>that Perl is nothing more than an extended regexp parser, if I were feeling
>whimsical....)

Sorry, Brandon, you need to re-read your formal languages textbook.
The set of formal languages is closed under complemenbt and
intersection as well as union (the | operator), kleene closure (the *
operator) and concatenation (the null operator).  The difference
between two regular languages is itself a regular language.  In other
words, given any two, arbitrarily complex regular expressions A and B,
there is a regular expression C that matches iff A matches and B does
not.

I've played with finite automaton tools that implement the difference
operator.

It just so happens that deterministic finite automata, nondeterministic
FA's, regular expressions involving difference, intersection, etc., and
regular expressions using only "", | and * are equivalent in power
(there are algorithms for translating among them), so there's no
theoretical need to provide more than one means of expressing what you
want.  It's sure convenient, though.
-- 	
	-Colin