arnold@apollo.COM (Ken Arnold) (01/27/89)
The POSIX proposal [] has a rework of regular expressions. In particular, the character set expresions (things like "[a-z]") have had a few new things added, but they way they have been added seems passing strange. I was wondering if I was alone in thinking the following suboptimal: The have added a new set of bracket expressions which stand for pre-defined sets of characters. For example, "[:alpha:]" is all alphabetic characters, "[.ch.]" is the character string ch treated as a single character (which is useful for sorting in many languages), and "[=a=]" refers to all variants of a, i.e., a, a with a circumflex, a with an umlaut, etc. Well, this sounds fine and dandy. Being able to express C variables as "[[:alpha:]_][[:alnum:]_]*" is reasonably descriptive. Being able to say "I don't care if the 'o' has any diacritical marks" is also fine. The problem is that, for some reason, if you want to simply match any alphabetic character, you *cannot* say "[:alpha:]". Or, to be more precise, that expression means exactly what it does now. If you say grep "+[:alnum:]+" file ... you will print any line which has a "+" followed by one of :, a, l, n, u, or m, followed by another "+". If you want to match what it *looks* like that expression would match, you have to say. grep "+[[:alnum:]]+" file ... In other words, these new bracket expressions only have their new meaning inside outer brackets. Why? The only existing expressions you would break if you allowed "top level" [::] expressions (or [..] or [==] expressions) would be expressions which currently existed that contained *two* colons (or dots or equals), on either side. Since this is currently pointless redundancy, I can't believe this is a serious problem. What seems like a serious problem to me is that the required nesting makes the new expressions more difficult to use. Further, misuse of them in this kind of obvious way leads to silent misbehavior from which it is difficult to surmise the bug. Is it just me, or is this wrong? Ken
gwyn@smoke.BRL.MIL (Doug Gwyn ) (01/27/89)
In article <4118f7b1.ae48@apollo.COM> arnold@apollo.COM (Ken Arnold) writes: >The POSIX proposal [] has a rework of regular expressions. ... >"[.ch.]" is the character string ch treated as a single character >(which is useful for sorting in many languages), ... This seems totally wrong to me. The pattern argument should consist of what ANSI C terms "multibyte characters", in which case no special indicators are required to take care of this. It looks like somebody wants to pander to existing sick implementations of foreign character sets instead of moving toward everybody doing it right (or at least, the same way!). >What seems like a serious problem to me is that the required nesting >makes the new expressions more difficult to use. Further, misuse of >them in this kind of obvious way leads to silent misbehavior from which >it is difficult to surmise the bug. More interestingly, it's still not fully upward compatible, because existing greps also already assign a meaning to '[[:alpha:]]'. If an incompatible change is to be made, best to engineer it carefully rather than worry about preserving compatibility with existing practice when that isn't going to be attained anyway. I hope the 1003.2 guys are looking into Rob Pike's extended regular expressions as used in "sam", or the ones the "gre" implementors have come up with. There is a LOT of existing practical experience that should be drawn on. The more pressing question is, why is a standard for this being attempted if it's premature?
smb@ulysses.homer.nj.att.com (Steven M. Bellovin) (01/27/89)
I suspect that POSIX is trying to deal with non-contiguous alphabets, that is, alphabets where [a-z] does not include all lower-case letters. Their syntax is quite ugly, though.
andrew@alice.UUCP (Andrew Hume) (01/27/89)
the new ctype style goo in regular expressions only applies inside [] expressions in order to ease compatability problems. i have been following the regular expression stuff and most of it seems okay, even if it is cumbersome. the thing that pisses me off is that they want to make \c where c is a regular (non-special) character exactly equivalent to c, rather than reserving it for future use. this is baffling to me; if we reserve \c in these cases, we have easy backward compatible ways of extending the syntax later on (like allowing more than 9 sub expressions). And i have no idea who they are protecting; people who have patterns like \t and expect them to match t ??
gwyn@smoke.BRL.MIL (Doug Gwyn ) (01/28/89)
In article <11148@ulysses.homer.nj.att.com> smb@ulysses.homer.nj.att.com (Steven M. Bellovin) writes: >I suspect that POSIX is trying to deal with non-contiguous alphabets, >that is, alphabets where [a-z] does not include all lower-case letters. I have no objection to the notion of an "alpha class" specifier etc. In fact it seems like the only possible way to deal correctly with general locales. I do object to the notion of denoting a single character in a special way just because it's a multibyte sequence. >Their syntax is quite ugly, though. Yes!
arnold@apollo.COM (Ken Arnold) (01/29/89)
In article <8843@alice.UUCP> andrew@alice.UUCP (Andrew Hume) writes: >the new ctype style goo in regular expressions only applies inside >[] expressions in order to ease compatability problems. Andrew, Puh-lease! In my article I stated why I didn't think this is a problem, and asked what I'd missed. What, pray tell, is *worse* about the compatibility problems with "[:alpha:]" than with "[[:alpha:]]"? Note that the second expression is also currently legal, although (I suspect), just as unlikely to appear in practice as the first. >the thing that pisses me off is that they want to make \c where c is >a regular (non-special) character exactly equivalent to c, ... Yeah, I agree with you on that. This makes little sense. Ken Arnold
donn@hpfcdc.HP.COM (Donn Terry) (01/31/89)
Ken Arnold's point about [[:alpha:]] is well taken. I suspect that if the proposal had been as he suggests that someone else would be saying that [:alpha:] must mean :,a,l,p,h,a, with : specified twice, for backwards compatability. Maybe not, but in the standards business it's easy to get paranoid because for practically any possibly controversial point, there's at least 2**n (where n is the number of partipants) viewpoints before everything gets settled. (Well, maybe 2*n :-) ). In Doug Gwyn's comments about [:ch:] As far as character classes: these are specified by the natural language involved. My Spanish is weak, but the *two characters* ch are treated as a single symbol with its own place in the collating sequence. c and h can also appear independently, but when adjacent they are collated as another symbol. This is arguably a kluge, but it antedates the computer business by a few hundred years, and a few million users, so I doubt we can change it just for the sake of aesthetics. Remember, we (native-)speakers of English are awfully spoiled by having a reasonably regular alphabet. It's reasonable to ask what things would have been like had computers had their initial development in, say, China or Japan, where the alphabet problem is much worse. I think the simple model of English may have sped things up initially, but it's now turning into an impediment for dealing with the rest of the world. (Oh well, we make up for a simple alphabet with hideously irrational spelling, even discounting the British/American differences :-) ). Donn Terry HP, Ft. Collins.
gwyn@smoke.BRL.MIL (Doug Gwyn ) (02/01/89)
In article <5980041@hpfcdc.HP.COM> donn@hpfcdc.HP.COM (Donn Terry) writes: >In Doug Gwyn's comments about [:ch:] As far as character classes: >these are specified by the natural language involved. My Spanish is >weak, but the *two characters* ch are treated as a single symbol with >its own place in the collating sequence. c and h can also appear >independently, but when adjacent they are collated as another symbol. >This is arguably a kluge, but it antedates the computer business by a >few hundred years, and a few million users, so I doubt we can change it >just for the sake of aesthetics. My Spanish is not too weak and I'm well aware of ch, ll, nn (written n-tilde), etc. German also has some interesting features (e.g. ss when capitalized). However, we took all this stuff into account when coming up with the multibyte character specifications in the proposed ANSI C standard. The "internationalization" community helped formulate that approach, and it bothers me more than somewhat to see it being ignored by 1003.2. A reasonable implementation of Spanish-language locale requires that ch etc. be multibyte sequences, not handled as multiple separate single-byte characters by "grep".
gwc@root.co.uk (Geoff Clare) (02/01/89)
In article <4118f7b1.ae48@apollo.COM> arnold@apollo.COM (Ken Arnold) writes: >The POSIX proposal [] has a rework of regular expressions. >(stuff deleted) > >They have added a new set of bracket expressions which stand for >pre-defined sets of characters. For example, "[:alpha:]" is all >alphabetic characters, "[.ch.]" is the character string ch treated as a >single character (which is useful for sorting in many languages), and >"[=a=]" refers to all variants of a, i.e., a, a with a circumflex, a >with an umlaut, etc. > >(stuff deleted)... these new bracket expressions only have their new >meaning inside outer brackets. > >Why? The only existing expressions you would break if you allowed "top >level" [::] expressions (or [..] or [==] expressions) would be >expressions which currently existed that contained *two* colons (or >dots or equals), on either side. Since this is currently pointless >redundancy, I can't believe this is a serious problem. There are more serious problems with the new expressions than just the obscure syntax. A short while ago I had to design some verification tests for these new regular expressions as part of the X/Open verification suite (the latest X/Open standard incorporates POSIX). I found some ambiguity in the area of 2 to 1 character mappings. For example, if ch collates between c and d, which of the following REs should match the string "xchy"? x[a-[.ch.]]y x[a-[.ch.]]hy The simple answer would be to create some rule about 2 to 1 character mappings to eliminate the ambiguity. However, whichever rule is decided, there will be many cases where the actual behaviour is non-intuitive, resulting in users not getting the results they expect. We have informed X/Open of the problem, and are waiting to see what they come up with. Geoff. -- Geoff Clare UniSoft Limited, Saunderson House, Hayne Street, London EC1A 9HH gwc@root.co.uk ...!mcvax!ukc!root44!gwc +44-1-606-7799 FAX: +44-1-726-2750
consult@osiris.UUCP (Unix Consultation Mailbox ) (02/01/89)
In article <5980041@hpfcdc.HP.COM> donn@hpfcdc.HP.COM (Donn Terry) writes: >[about the difficulty of collating foreign character sets] >Remember, we (native-)speakers of English are awfully spoiled by having a >reasonably regular alphabet.... I think the >simple model of English may have sped things up initially, but it's now >turning into an impediment for dealing with the rest of the world. Then again, English (at least in America) has its own arcane rules for collating names, one of which (Mc being literally equivalent to Mac) has only recently been abandoned (about the same time that PCs took over the business world, I think). The convention that initials collate separately from "real" names is still followed by most companies who print ROC phone books but that's the only weird rule left so far as I know. >(Oh well, we make up for a simple alphabet with hideously irrational >spelling, even discounting the British/American differences :-) ). The English spelling rules are totally rational. The problem is that only about 5% of the language follows the rules. :-) Pronunciation rules are what *really* get me rolling on the floor... Phil Kos
peter@ficc.uu.net (Peter da Silva) (02/02/89)
Why didn't they add a few backslash-escaped metacharacters, again? \:alpha: \-ch- \=o= This would certainly be backwards-compatible, and would have the added advantage of nicer aesthetics. -- Peter da Silva, Xenix Support, Ferranti International Controls Corporation. Work: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180. `-_-' Home: bigtex!texbell!sugar!peter, peter@sugar.uu.net. 'U` Opinions may not represent the policies of FICC or the Xenix Support group.
gwyn@smoke.BRL.MIL (Doug Gwyn ) (02/02/89)
In article <2962@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes: >Why didn't they add a few backslash-escaped metacharacters, again? Whatever you do, DON'T use \ for any more escapes! It is bad enough that the terminal handler, shell, and troff all use the SAME escape character (which has occasionally forced me to type sixteen \s in a row). It would have been far better for utilities to try to pick escapes that would be unlikely to be used by others that might be in effect as the same time (shell, editor, terminal handler, grep).
peter@ficc.uu.net (Peter da Silva) (02/03/89)
In article <9563@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn ) writes: > In article <2962@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes: > >Why didn't they add a few backslash-escaped metacharacters, again? > Whatever you do, DON'T use \ for any more escapes! I'd agree in principle that this is a good idea for new programs, but since the existing regular expression syntax already uses \ for escapes adding a new syntax for metacharacters [:...:] is no help (you still need \(...\), for example), and adds gratuitous inconsistancies.. -- Peter da Silva, Xenix Support, Ferranti International Controls Corporation. Work: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180. `-_-' Home: bigtex!texbell!sugar!peter, peter@sugar.uu.net. 'U` Opinions may not represent the policies of FICC or the Xenix Support group.
wk@hpirs.HP.COM (Wayne Krone) (02/07/89)
I was involved in the design of the internationalization extensions to regular expressions as a member of both the X/Open and /usr/group Internationalization working group/committees. I'm not the official spokesperson for either group but I can probably answer most of your questions. In addition, many of the points raised are discussed by the rationale in the P1003.2 draft (section 2.9.1, pages 58-64, of draft 8). Why "[[:alpha:]]" instead of "[:alpha:]" ? The committee was very concerned about the acceptability of any extensions with the folks not actively involved in internationalization due to the possibility of breaking of existing regular expressions. One of the ways we decided to minimize the risk was to require double delimiters on all the new syntax. For example: [[:lower:][:digit:]] rather than [:lower::digit:] Other reasons were to reduce ambiguity problems and to have delimiters which visually indicated left/right closure of the new syntax. More details are in the draft. Why allow "[[.ch.]]" instead of requiring the Spanish ch to be an ANSI C multibyte character? First, because no existing implementation does it that way (that we know of). Second, and more importantly, "c", "h" and "ch" as matched by the RE "[a-z]" are collating elements, not characters. Only "c" and "h" are also characters and thus are represented in a single or multibyte character code set. As someone else noted, "Mac" and "Mc" could be supported as collating elements but it is very unlikely a code set would ever support them as single characters. > What seems like a serious problem to me is that the required nesting > makes the new expressions more difficult to use. Further, misuse of > them in this kind of obvious way leads to silent misbehavior from which > it is difficult to surmise the bug. That is, a user might do: [:alpha:] but intended: [[:alpha:]] and get no error message? Well the comment above is true but if the syntax was as suggested: [:alpha:] but the user typed: [:alphz:] the same silent misbehavior results. I suppose its a matter of guessing which errors will be most common and optimizing the syntax for that set. > the thing that pisses me off is that they want to make \c where c is > a regular (non-special) character exactly equivalent to c, > rather than reserving it for future use. this is baffling to me; > if we reserve \c in these cases, we have easy backward compatible ways > of extending the syntax later on (like allowing more than 9 sub expressions). > And i have no idea who they are protecting; people who have patterns > like \t and expect them to match t ?? This was just a matter of documenting the behavior of the existing implementations (as we know them :-). Much of the regular expression syntax/behavior is left unspecified by the traditional definition on the ed man page and we found ourselves in the position of having to write down something. If your implementation differs or you just feel this is an area worth improving, submit a proposal to P1003.2. > There are more serious problems with the new expressions than just the > obscure syntax. A short while ago I had to design some verification > tests for these new regular expressions as part of the X/Open verification > suite (the latest X/Open standard incorporates POSIX). I found some > ambiguity in the area of 2 to 1 character mappings. For example, if ch > collates between c and d, which of the following REs should match the > string "xchy"? > > x[a-[.ch.]]y > x[a-[.ch.]]hy > > The simple answer would be to create some rule about 2 to 1 character > mappings to eliminate the ambiguity. However, whichever rule is The rule which applies is the "longest leftmost match" rule which is documented in XPG3 for the "RE*" syntax but unfortunately missing from the square bracket rules. So the answer for the examples above is "both": x[a-[.ch.]]y matches x ch y x[a-[.ch.]]hy matches x c h y > We have informed X/Open of the problem, and are waiting to see what they > come up with. That's interesting--I haven't seen any query posted to the Internationalization Working Group. Wayne Krone