karl@IMA.ISC.COM (Karl Heuer) (09/06/90)
From: karl@IMA.ISC.COM (Karl Heuer) In an environment where the digraph "ch" collates as a single element, what happens if an attempt is made to match the subject string "chi" with the pattern "[c[.ch.]]i" or "[c[.ch.]]hi"? Is the implementation required to report a successful match in both cases? If so, it would seem necessary to use a nondeterministic finite automaton or equivalent, thus making simple regexp matching and filename globbing as complex as egrep pattern matching. If you have an answer that's based on something other than your own intuition, please specify which (draft) standard you're referencing. Karl W. Z. Heuer (karl@kelp.ima.isc.com or ima!kelp!karl), The Walking Lint Volume-Number: Volume 21, Number 82
henry@zoo.toronto.edu (Henry Spencer) (09/09/90)
From: henry@zoo.toronto.edu (Henry Spencer) In article <487@usenix.ORG> karl@IMA.ISC.COM (Karl Heuer) writes: >In an environment where the digraph "ch" collates as a single element, what >happens if an attempt is made to match the subject string "chi" with the >pattern "[c[.ch.]]i" or "[c[.ch.]]hi"? Is the implementation required to >report a successful match in both cases? If so, it would seem necessary to >use a nondeterministic finite automaton or equivalent, thus making simple >regexp matching and filename globbing as complex as egrep pattern matching. Looking at draft 10, I don't think there is much doubt that they both must match, assuming those are legal regular expressions. (If "c" is not a collating element or noncollating character, they're not.) If both "c" and "ch" are valid collating elements, the bracket expression must be prepared to match either. The wording could stand improving. As for the implementation aspects, yes, this is a pain. However, there is basically no such thing as "simple" regexp matching. :-) The extra complexity added by multicharacter collating elements, while annoying, is not that big a deal. I think Karl is confused. *Any* non-trivial regexp matching ends up using either nondeterministic or deterministic automata, sometimes behind clever plastic disguises. The very simplest forms, like globbing, sometimes can get away without having to compile the regexp string into an internal form, by running a nondeterministic automaton directly from the regexp string. That will get a bit harder because of the greater complexity of 1003.2 regexps. However, there is no way that even "simple" regexp matching (I assume Karl is thinking of things like ed) is viable without a compilation step. Given that 1003.2 defines -- finally! -- library functions for regexp work of various kinds, including globbing, the complexity will, in any case, be localized to library functions. The programmer won't have to worry about it unless he's actually writing those library functions. (*That* won't be fun.) -- TCP/IP: handling tomorrow's loads today| Henry Spencer at U of Toronto Zoology OSI: handling yesterday's loads someday| henry@zoo.toronto.edu utzoo!henry Volume-Number: Volume 21, Number 95