[comp.unix.wizards] POSIX Regular Expression Funnyness

arnold@apollo.COM (Ken Arnold) (01/27/89)

The POSIX proposal [] has a rework of regular expressions.  In
particular, the character set expresions (things like "[a-z]") have had
a few new things added, but they way they have been added seems passing
strange.  I was wondering if I was alone in thinking the following
suboptimal:

The have added a new set of bracket expressions which stand for
pre-defined sets of characters.  For example, "[:alpha:]" is all
alphabetic characters, "[.ch.]" is the character string ch treated as a
single character (which is useful for sorting in many languages), and
"[=a=]" refers to all variants of a, i.e., a, a with a circumflex, a
with an umlaut, etc.

Well, this sounds fine and dandy.  Being able to express C variables as
"[[:alpha:]_][[:alnum:]_]*" is reasonably descriptive.  Being able to
say "I don't care if the 'o' has any diacritical marks" is also fine.

The problem is that, for some reason, if you want to simply match any
alphabetic character, you *cannot* say "[:alpha:]".  Or, to be more
precise, that expression means exactly what it does now.  If you say

	grep "+[:alnum:]+" file ...

you will print any line which has a "+" followed by one of :, a, l, n,
u, or m, followed by another "+".  If you want to match what it *looks*
like that expression would match, you have to say.

	grep "+[[:alnum:]]+" file ...

In other words, these new bracket expressions only have their new
meaning inside outer brackets.

Why?  The only existing expressions you would break if you allowed "top
level" [::] expressions (or [..] or [==] expressions) would be
expressions which currently existed that contained *two* colons (or
dots or equals), on either side.  Since this is currently pointless
redundancy, I can't believe this is a serious problem.

What seems like a serious problem to me is that the required nesting
makes the new expressions more difficult to use.  Further, misuse of
them in this kind of obvious way leads to silent misbehavior from which
it is difficult to surmise the bug.

Is it just me, or is this wrong?

		Ken

gwyn@smoke.BRL.MIL (Doug Gwyn ) (01/27/89)

In article <4118f7b1.ae48@apollo.COM> arnold@apollo.COM (Ken Arnold) writes:
>The POSIX proposal [] has a rework of regular expressions.  ...
>"[.ch.]" is the character string ch treated as a single character
>(which is useful for sorting in many languages), ...

This seems totally wrong to me.  The pattern argument should consist
of what ANSI C terms "multibyte characters", in which case no special
indicators are required to take care of this.  It looks like somebody
wants to pander to existing sick implementations of foreign character
sets instead of moving toward everybody doing it right (or at least,
the same way!).

>What seems like a serious problem to me is that the required nesting
>makes the new expressions more difficult to use.  Further, misuse of
>them in this kind of obvious way leads to silent misbehavior from which
>it is difficult to surmise the bug.

More interestingly, it's still not fully upward compatible, because
existing greps also already assign a meaning to '[[:alpha:]]'.  If an
incompatible change is to be made, best to engineer it carefully rather
than worry about preserving compatibility with existing practice when
that isn't going to be attained anyway.

I hope the 1003.2 guys are looking into Rob Pike's extended regular
expressions as used in "sam", or the ones the "gre" implementors have
come up with.  There is a LOT of existing practical experience that
should be drawn on.

The more pressing question is, why is a standard for this being
attempted if it's premature?

smb@ulysses.homer.nj.att.com (Steven M. Bellovin) (01/27/89)

I suspect that POSIX is trying to deal with non-contiguous alphabets,
that is, alphabets where [a-z] does not include all lower-case letters.
Their syntax is quite ugly, though.

andrew@alice.UUCP (Andrew Hume) (01/27/89)

the new ctype style goo in regular expressions only applies inside
[] expressions in order to ease compatability problems.

i have been following the regular expression stuff and most of
it seems okay, even if it is cumbersome.

the thing that pisses me off is that they want to make \c where c is
a regular (non-special) character exactly equivalent to c,
rather than reserving it for future use. this is baffling to me;
if we reserve \c in these cases, we have easy backward compatible ways
of extending the syntax later on (like allowing more than 9 sub expressions).
And i have no idea who they are protecting; people who have patterns
like \t and expect them to match t ??

gwyn@smoke.BRL.MIL (Doug Gwyn ) (01/28/89)

In article <11148@ulysses.homer.nj.att.com> smb@ulysses.homer.nj.att.com (Steven M. Bellovin) writes:
>I suspect that POSIX is trying to deal with non-contiguous alphabets,
>that is, alphabets where [a-z] does not include all lower-case letters.

I have no objection to the notion of an "alpha class" specifier etc.
In fact it seems like the only possible way to deal correctly with
general locales.  I do object to the notion of denoting a single
character in a special way just because it's a multibyte sequence.

>Their syntax is quite ugly, though.

Yes!

arnold@apollo.COM (Ken Arnold) (01/29/89)

In article <8843@alice.UUCP> andrew@alice.UUCP (Andrew Hume) writes:
>the new ctype style goo in regular expressions only applies inside
>[] expressions in order to ease compatability problems.

Andrew, Puh-lease!  In my article I stated why I didn't think this is
a problem, and asked what I'd missed.  What, pray tell, is *worse*
about the compatibility problems with "[:alpha:]" than with "[[:alpha:]]"?
Note that the second expression is also currently legal, although (I
suspect), just as unlikely to appear in practice as the first.

>the thing that pisses me off is that they want to make \c where c is
>a regular (non-special) character exactly equivalent to c, ...

Yeah, I agree with you on that.  This makes little sense.

		Ken Arnold

donn@hpfcdc.HP.COM (Donn Terry) (01/31/89)

Ken Arnold's point about [[:alpha:]] is well taken.  I suspect that if
the proposal had been as he suggests that someone else would be saying
that [:alpha:] must mean :,a,l,p,h,a, with : specified twice, for backwards
compatability.  Maybe not, but in the standards business it's easy to get
paranoid because for practically any possibly controversial point, there's
at least 2**n (where n is the number of partipants) viewpoints before 
everything gets settled. (Well, maybe 2*n :-) ).

In Doug Gwyn's comments about [:ch:]  As far as character classes:
these are specified by the natural language involved.  My Spanish is
weak, but the *two characters* ch are treated as a single symbol with
its own place in the collating sequence.  c and h can also appear
independently, but when adjacent they are collated as another symbol.
This is arguably a kluge, but it antedates the computer business by a
few hundred years, and a few million users, so I doubt we can change it
just for the sake of aesthetics.

Remember, we (native-)speakers of English are awfully spoiled by having a
reasonably regular alphabet.  It's reasonable to ask what things would
have been like had computers had their initial development in, say,
China or Japan, where the alphabet problem is much worse.  I think the
simple model of English may have sped things up initially, but it's now
turning into an impediment for dealing with the rest of the world.
(Oh well, we make up for a simple alphabet with hideously irrational
spelling, even discounting the British/American differences :-) ).

Donn Terry
HP, Ft. Collins.

gwyn@smoke.BRL.MIL (Doug Gwyn ) (02/01/89)

In article <5980041@hpfcdc.HP.COM> donn@hpfcdc.HP.COM (Donn Terry) writes:
>In Doug Gwyn's comments about [:ch:]  As far as character classes:
>these are specified by the natural language involved.  My Spanish is
>weak, but the *two characters* ch are treated as a single symbol with
>its own place in the collating sequence.  c and h can also appear
>independently, but when adjacent they are collated as another symbol.
>This is arguably a kluge, but it antedates the computer business by a
>few hundred years, and a few million users, so I doubt we can change it
>just for the sake of aesthetics.

My Spanish is not too weak and I'm well aware of ch, ll, nn (written
n-tilde), etc.  German also has some interesting features (e.g. ss when
capitalized).  However, we took all this stuff into account when coming
up with the multibyte character specifications in the proposed ANSI C
standard.  The "internationalization" community helped formulate that
approach, and it bothers me more than somewhat to see it being ignored
by 1003.2.  A reasonable implementation of Spanish-language locale
requires that ch etc. be multibyte sequences, not handled as multiple
separate single-byte characters by "grep".

gwc@root.co.uk (Geoff Clare) (02/01/89)

In article <4118f7b1.ae48@apollo.COM> arnold@apollo.COM (Ken Arnold) writes:
>The POSIX proposal [] has a rework of regular expressions.
>(stuff deleted)
>
>They have added a new set of bracket expressions which stand for
>pre-defined sets of characters.  For example, "[:alpha:]" is all
>alphabetic characters, "[.ch.]" is the character string ch treated as a
>single character (which is useful for sorting in many languages), and
>"[=a=]" refers to all variants of a, i.e., a, a with a circumflex, a
>with an umlaut, etc.
>
>(stuff deleted)... these new bracket expressions only have their new
>meaning inside outer brackets.
>
>Why?  The only existing expressions you would break if you allowed "top
>level" [::] expressions (or [..] or [==] expressions) would be
>expressions which currently existed that contained *two* colons (or
>dots or equals), on either side.  Since this is currently pointless
>redundancy, I can't believe this is a serious problem.

There are more serious problems with the new expressions than just the
obscure syntax.  A short while ago I had to design some verification
tests for these new regular expressions as part of the X/Open verification
suite (the latest X/Open standard incorporates POSIX).  I found some
ambiguity in the area of 2 to 1 character mappings.  For example, if ch
collates between c and d, which of the following REs should match the
string "xchy"?

	x[a-[.ch.]]y
	x[a-[.ch.]]hy

The simple answer would be to create some rule about 2 to 1 character
mappings to eliminate the ambiguity.  However, whichever rule is
decided, there will be many cases where the actual behaviour is
non-intuitive, resulting in users not getting the results they expect.

We have informed X/Open of the problem, and are waiting to see what they
come up with.

Geoff.
-- 

Geoff Clare    UniSoft Limited, Saunderson House, Hayne Street, London EC1A 9HH
gwc@root.co.uk   ...!mcvax!ukc!root44!gwc   +44-1-606-7799  FAX: +44-1-726-2750

consult@osiris.UUCP (Unix Consultation Mailbox ) (02/01/89)

In article <5980041@hpfcdc.HP.COM> donn@hpfcdc.HP.COM (Donn Terry) writes:
>[about the difficulty of collating foreign character sets]
>Remember, we (native-)speakers of English are awfully spoiled by having a
>reasonably regular alphabet....  I think the
>simple model of English may have sped things up initially, but it's now
>turning into an impediment for dealing with the rest of the world.

Then again, English (at least in America) has its own arcane rules
for collating names, one of which (Mc being literally equivalent to
Mac) has only recently been abandoned (about the same time that PCs
took over the business world, I think).  The convention that initials
collate separately from "real" names is still followed by most
companies who print ROC phone books but that's the only weird rule
left so far as I know.

>(Oh well, we make up for a simple alphabet with hideously irrational
>spelling, even discounting the British/American differences :-) ).

The English spelling rules are totally rational.  The problem is that
only about 5% of the language follows the rules.  :-)  Pronunciation
rules are what *really* get me rolling on the floor...

Phil Kos

peter@ficc.uu.net (Peter da Silva) (02/02/89)

Why didn't they add a few backslash-escaped metacharacters, again?

	\:alpha:
	\-ch-
	\=o=

This would certainly be backwards-compatible, and would have the added
advantage of nicer aesthetics.
-- 
Peter da Silva, Xenix Support, Ferranti International Controls Corporation.
Work: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180.   `-_-'
Home: bigtex!texbell!sugar!peter, peter@sugar.uu.net.                 'U`
Opinions may not represent the policies of FICC or the Xenix Support group.

gwyn@smoke.BRL.MIL (Doug Gwyn ) (02/02/89)

In article <2962@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes:
>Why didn't they add a few backslash-escaped metacharacters, again?

Whatever you do, DON'T use \ for any more escapes!  It is bad enough
that the terminal handler, shell, and troff all use the SAME escape
character (which has occasionally forced me to type sixteen \s in a row).
It would have been far better for utilities to try to pick escapes that
would be unlikely to be used by others that might be in effect as the
same time (shell, editor, terminal handler, grep).

peter@ficc.uu.net (Peter da Silva) (02/03/89)

In article <9563@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn ) writes:
> In article <2962@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes:
> >Why didn't they add a few backslash-escaped metacharacters, again?

> Whatever you do, DON'T use \ for any more escapes!

I'd agree in principle that this is a good idea for new programs, but
since the existing regular expression syntax already uses \ for escapes
adding a new syntax for metacharacters [:...:] is no help (you still
need \(...\), for example), and adds gratuitous inconsistancies..
-- 
Peter da Silva, Xenix Support, Ferranti International Controls Corporation.
Work: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180.   `-_-'
Home: bigtex!texbell!sugar!peter, peter@sugar.uu.net.                 'U`
Opinions may not represent the policies of FICC or the Xenix Support group.

wk@hpirs.HP.COM (Wayne Krone) (02/07/89)

I was involved in the design of the internationalization extensions to
regular expressions as a member of both the X/Open and /usr/group
Internationalization working group/committees.  I'm not the official
spokesperson for either group but I can probably answer most of your
questions.  In addition, many of the points raised are discussed by the
rationale in the P1003.2 draft (section 2.9.1, pages 58-64, of draft 8).

Why "[[:alpha:]]" instead of "[:alpha:]" ?

   The committee was very concerned about the acceptability of any
   extensions with the folks not actively involved in internationalization
   due to the possibility of breaking of existing regular expressions.
   One of the ways we decided to minimize the risk was to require
   double delimiters on all the new syntax.  For example:

			[[:lower:][:digit:]]
   rather than
			[:lower::digit:]

   Other reasons were to reduce ambiguity problems and to have delimiters
   which visually indicated left/right closure of the new syntax.  More
   details are in the draft.

Why allow "[[.ch.]]" instead of requiring the Spanish ch to be an
ANSI C multibyte character?

   First, because no existing implementation does it that way (that we
   know of).  Second, and more importantly, "c", "h" and "ch" as matched
   by the RE "[a-z]" are collating elements, not characters.  Only "c"
   and "h" are also characters and thus are represented in a single or
   multibyte character code set.  As someone else noted, "Mac" and "Mc"
   could be supported as collating elements but it is very unlikely a
   code set would ever support them as single characters.

> What seems like a serious problem to me is that the required nesting
> makes the new expressions more difficult to use.  Further, misuse of
> them in this kind of obvious way leads to silent misbehavior from which
> it is difficult to surmise the bug.

That is, a user might do:

			[:alpha:]
but intended:
			[[:alpha:]]

and get no error message?  Well the comment above is true but if the
syntax was as suggested:

			[:alpha:]
but the user typed:
			[:alphz:]

the same silent misbehavior results.  I suppose its a matter of guessing
which errors will be most common and optimizing the syntax for that set.

> the thing that pisses me off is that they want to make \c where c is
> a regular (non-special) character exactly equivalent to c,
> rather than reserving it for future use. this is baffling to me;
> if we reserve \c in these cases, we have easy backward compatible ways
> of extending the syntax later on (like allowing more than 9 sub expressions).
> And i have no idea who they are protecting; people who have patterns
> like \t and expect them to match t ??

This was just a matter of documenting the behavior of the existing
implementations (as we know them :-).  Much of the regular expression
syntax/behavior is left unspecified by the traditional definition on
the ed man page and we found ourselves in the position of having to
write down something.  If your implementation differs or you just feel
this is an area worth improving, submit a proposal to P1003.2.

> There are more serious problems with the new expressions than just the
> obscure syntax.  A short while ago I had to design some verification
> tests for these new regular expressions as part of the X/Open verification
> suite (the latest X/Open standard incorporates POSIX).  I found some
> ambiguity in the area of 2 to 1 character mappings.  For example, if ch
> collates between c and d, which of the following REs should match the
> string "xchy"?
> 
> 	x[a-[.ch.]]y
> 	x[a-[.ch.]]hy
>
> The simple answer would be to create some rule about 2 to 1 character
> mappings to eliminate the ambiguity.  However, whichever rule is

The rule which applies is the "longest leftmost match" rule which
is documented in XPG3 for the "RE*" syntax but unfortunately missing
from the square bracket rules.

So the answer for the examples above is "both":

	x[a-[.ch.]]y	matches    x ch y
	x[a-[.ch.]]hy	matches    x c h y

> We have informed X/Open of the problem, and are waiting to see what they
> come up with.

That's interesting--I haven't seen any query posted to the
Internationalization Working Group.

Wayne Krone