[comp.unix.questions] Regular Expression tool

wdm@icc.com (Bill Mulert) (06/09/90)

Consider the following statements containing regular expressions:

echo "`expr \"$1\" : \"^[^=]*=\(.*\)\"`"

df_usr=`df | sed -n '/^\/usr[   ]/s/[^)]*):[    ]*\([^  ]*\).*/\1/p'`

sed	-e 's/\([!:]\)\([0-9]\)/\1 \2/' \
	-e '/!/s/^\([^ 	][^ 	]*\).*[,-][,-]*\([0-9][0-9]*\)$/\1 1-\2/' \
	< .newsrc.old > .newsrc

sed 's/^\([^:! 	]*\).*$/\1/' $ACTIVE | sort > $TMPFILE.1

Do you have a headache, now? I do. I find any but the simplist regular
expressions to be "write only". They are rather like C's declarations
that so often cause even veteran programmers to look askance.
Fortunately, we have cdecl to help create and decode the C declarations.

I wish there were something similar for regular expressions. I would
like to have a tool, call it regex, that would allow me to say:

regex ' "^[^=]*=\(.*\)\" '
and have regex say, in plain language, what the expression means.

Is there anything like that in existance? Any ideas on how large
a project like that might be?
-- 
I'm tired of hearin' songs about cheatin',|Bill Mulert   wdm@icc.com
     marguaritas, & drivin' trucks.       |Intercomputer Communications Corp.
      I'm going back to Beethoven,        |Cincinnati, Ohio  45236
      'cause country music sucks.         |513-745-0500

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (06/12/90)

In article <1990Jun8.174056.15313@icc.com> wdm@icc.com (Bill Mulert) writes:
: Consider the following statements containing regular expressions:
: 
: echo "`expr \"$1\" : \"^[^=]*=\(.*\)\"`"
: 
: df_usr=`df | sed -n '/^\/usr[   ]/s/[^)]*):[    ]*\([^  ]*\).*/\1/p'`
: 
: sed	-e 's/\([!:]\)\([0-9]\)/\1 \2/' \
: 	-e '/!/s/^\([^ 	][^ 	]*\).*[,-][,-]*\([0-9][0-9]*\)$/\1 1-\2/' \
: 	< .newsrc.old > .newsrc
: 
: sed 's/^\([^:! 	]*\).*$/\1/' $ACTIVE | sort > $TMPFILE.1
: 
: Do you have a headache, now? I do. I find any but the simplist regular
: expressions to be "write only". They are rather like C's declarations
: that so often cause even veteran programmers to look askance.
: Fortunately, we have cdecl to help create and decode the C declarations.
: 
: I wish there were something similar for regular expressions. I would
: like to have a tool, call it regex, that would allow me to say:
: 
: regex ' "^[^=]*=\(.*\)\" '
: and have regex say, in plain language, what the expression means.
: 
: Is there anything like that in existance? Any ideas on how large
: a project like that might be?

It's not likely to be too practical, for a couple of reasons.

First, there a number of different standards out there.  For instance,
sed and expr use \( ... \) to indicate grouping, while egrep and perl
use ( ... ) for grouping, and \( and \) to indicate real parens.  (I'm
of course prejudiced in favor of the latter, but I think it's more readable
on the whole, since you do grouping a lot more often than you match real
parens.)  On top of that, when are ?, +, |, { and } metacharacters?  They
are in some programs, and aren't in others.  Are you going to have a
switch?

	regex -sed   ' "^[^=]*=\(.*\)\" '
	regex -expr  ' "^[^=]*=\(.*\)\" '
	regex -egrep ' "^[^=]*=\(.*\)\" '
	regex -perl  ' "^[^=]*=\(.*\)\" '
	regex -ed    ' "^[^=]*=\(.*\)\" '
	regex -emacs ' "^[^=]*=\(.*\)\" '
	regex -vi    ' "^[^=]*=\(.*\)\" '

Second, your big problem is not so much the regular expressions themselves
as it is all the quoting you have to put around them because of the paucity of
quoting mechanisms.  Take your first example:

    echo "`expr \"$1\" : \"^[^=]*=\(.*\)\"`"

If we blame the gobbldygookiness on the backslashes, we see that half
the problem is that we are quoting three deep, so we have to use \", and
the other half of the problem is that \( ... \) are the grouping
metacharacters.  I think the following is more readable simply because
of the absence of \, which is simply too heavily overloaded in Unix:

    perl -e 'print shift =~ /^[^=]*=(.*)/' "$1"

Using /PATTERN/ to search filenames forces you to backslash all the slashes
in the pattern:

    df_usr=`df | sed -n '/^\/usr[   ]/s/[^)]*):[    ]*\([^  ]*\).*/\1/p'`
			   ^^

It helps to have an alternate pattern delimiting method.  sed lets you have
an alternate delimiter on substitutions, but not on pattern matches.  (Perl
gives you both.)  Even in sed, you could write the above as:

    df_usr=`df | sed -n 's#^/usr[   ][^)]*):[    ]*\([^  ]*\).*#\1#p'`

That gets rid of one backslash, anyway.  Other filename patterns will
benefit more.  Filename patterns are the primary reason I added m#PATTERN#
to perl, where # can be any delimeter.

Similarly, we see a lot of cruft is there simply because of the overly
minimalistic implementations of some regexps.  Such as having to repeat
character classes because there's no +, or having to use uninterpretable
whitespace because there's no alternate way to specify spaces and tabs.

Compare

: sed	-e 's/\([!:]\)\([0-9]\)/\1 \2/' \
: 	-e '/!/s/^\([^ 	][^ 	]*\).*[,-][,-]*\([0-9][0-9]*\)$/\1 1-\2/' \
: 	< .newsrc.old > .newsrc

to

perl -p	-e 's/([!:])([0-9])/$1 $2/' \
	-e '/!/ && s/^(\S+).*[,-]+([0-9]+)$/$1 1-$2/' \
	< .newsrc.old > .newsrc

Actually, I'd probably write that as

perl -pe 's/:\s*/: /;  s/!.*\D(\d+)$/! 1-$1/;' .newsrc.old >.newsrc

Whatever.  For the most part, I don't think the problem with understanding
regular expressions is the regular expressions themselves, but all the
claptrap surrounding them.  And that will be very difficult to write
a decoder for.

Unix is not a simple language.

Larry Wall
lwall@jpl-devvax.jpl.nasa.gov

thorinn@skinfaxe.diku.dk (Lars Henrik Mathiesen) (06/13/90)

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
>In article <1990Jun8.174056.15313@icc.com> wdm@icc.com (Bill Mulert) writes:
>: Consider the following statements containing regular expressions:
>: ...
>: Fortunately, we have cdecl to help create and decode the C declarations.
>: 
>: I wish there were something similar for regular expressions.

>It's not likely to be too practical, for a couple of reasons.

>...

>Second, your big problem is not so much the regular expressions themselves
>as it is all the quoting you have to put around them because of the paucity of
>quoting mechanisms.

What we really need is a shell script explainer. It would know Bourne
shell syntax; when you run a script through it, any shell
single-command which uses more than one level of quoting will be
explained in excruciating detail. It would also know enough about
expr, sed, egrep etc. to recognize regular expressions, and they would
be converted to a standard form (perl's, maybe). (Perl, of course, is
self-explanatory (and much too hard to parse)).
Example of possible output: 

echo "`expr \"$1\" : \"^[^=]*=\(.*\)\"`"
#is taken as: echo "@1"
#where @1 is: `@2`
#where @2 is: expr "$1" : "@3"
#where @3 is: ^[^=]*=(.*)		Literal: "^[^=]*=\\(.*\\)"

df_usr=`df | sed -n '/^\/usr[   ]/s/[^)]*):[    ]*\([^  ]*\).*/\1/p'`
#is taken as: df_usr=`@1`
#where @1 is: df | sed -n '@2'
#where @2 is: /@3/s/@4/@5/p
#where @3 is: ^/usr\s			Literal: "^\\/usr[ \t]"
#where @4 is: [^)]*\):\s*(\S*).*	Literal: "[^)]*):[ \t]*\\([^ \t]*\\).*"
#where @5 is: $1			Literal: "\\1"

The Literal: strings (which I have written as C strings) should be
present whenever an argument to a command contains tabs or control
characters, or when it is converted as a regular expression.

The thing doesn't really have to parse shell language: Just cut at
newline, ';', ';;', '|', '||', ... (when unescaped), repeatedly strip
'if', 'for', '{', ... from the beginning of strings, and the
single-commands are left. The ``parsers'' for the regexp commands just
have to find the regexps; they can probably be just as simple. It
could probably be implemented in perl fairly easily.

--
Lars Mathiesen, DIKU, U of Copenhagen, Denmark      [uunet!]mcsun!diku!thorinn
Institute of Datalogy -- we're scientists, not engineers.      thorinn@diku.dk