merlyn@iwarp.intel.com (Randal Schwartz) (07/04/90)
In article <1990Jul3.144552.5407@uvaarpa.Virginia.EDU>, worley@compass (Dale Worley) writes: | Also, this illustrates one thing I don't like about regexps -- people | write code which depends on the order in which the alternatives are | matched. Regexps are well-defined and extremely predicatable about their "leftmost wildcard matches the most possible iterations" behavior. It's no more silly than presuming that "a" really matches "a", is it? | For instance, in the regexp above, the case where [^\0]? | matches the null string can always match, so it implicitly depends on | the fact that the non-null match is tried first. Ugh. It does this *by* *definition*. No assumption necessary. | On the other hand, | it's hard (impossible?) to write a regexp which matches in only the | right way without some way to specify context for the match (shades of | \: and \;!!!). It's probably durn near impossible, and an unnecessary burden on the part of the programmer. For example, what is the context for matching /ab.*cd/ in the string "aaababfoocdcdce"? I frequently run up against * and + matching a bit too much, and want to "back it off" a bit, but have found the problem without general solution. For example, matching the first two-digit number in a line, discarding all text before it. I want to write: s/.*(\d\d)/\1/; but instead am forced to do something like: /\d\d.*/; $_ = $&; The first expression matches the *last* occurrance of two digits. (I know... the second one discards the newline... gimme a break.) Regexps... your best friend... your worst enemy. You decide. :-) s//Just another Perl hacker,/; print -- /=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\ | on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III | | merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn | \=Cute Quote: "Welcome to Portland, Oregon, home of the California Raisins!"=/
tneff@bfmny0.BFM.COM (Tom Neff) (07/04/90)
In article <1990Jul3.184351.3820@iwarp.intel.com> merlyn@iwarp.intel.com (Randal Schwartz) writes: >I frequently run up against * and + matching a bit too much, and want >to "back it off" a bit, but have found the problem without general >solution. > >For example, matching the first two-digit number in a line, discarding >all text before it. I want to write: > > s/.*(\d\d)/\1/; > >but instead am forced to do something like: > > /\d\d.*/; $_ = $&; Try /\d\d.*\n?$/ && $_ = $&; or my fave /\d{2}/ && substr($_,0,length($`)) = ''; This doesn't lose the newline and you don't care what comes after. (I wish Larry'd make $` and $' lvalues, hehe.) -- "My God, Thiokol, when do you \\ Tom Neff want me to launch? Next April?" \\ tneff@bfmny0.BFM.COM
worley@compass.com (Dale Worley) (07/05/90)
From: merlyn@iwarp.intel.com (Randal Schwartz) Regexps are well-defined and extremely predicatable about their "leftmost wildcard matches the most possible iterations" behavior. OK, where is it written down? Also, you forgot to mention that | attempts to match the left alternative before the right. (I mention that to show that it's not so easy to completely define dynamic behavior.) It's no more silly than presuming that "a" really matches "a", is it? Well, every exposition of "regular expressions" I've ever seen states that "a" matches only "a", but I've never seen *in print* anyone stating that regexps match the most possible first. Also, I have a general preference for declarative definitions of things above dynamic definitions. Dale Worley Compass, Inc. worley@compass.com -- It was peculiarly satisfying to watch the reactions at the truck stops along the way to this bunch of men with hippie long hair, biker leather jackets, and a nose ring who nonetheless were warm, intelligent, friendly and polite, and paid with credit cards - we blew all their possible-stereotype fuses.
merlyn@iwarp.intel.com (Randal Schwartz) (07/06/90)
In article <1990Jul5.135434.11673@uvaarpa.Virginia.EDU>, worley@compass (Dale Worley) writes: | Regexps are well-defined and extremely predicatable about their | "leftmost wildcard matches the most possible iterations" behavior. | | OK, where is it written down? Also, you forgot to mention that | | attempts to match the left alternative before the right. (I mention | that to show that it's not so easy to completely define dynamic | behavior.) Say "man egrep" on any sane machine. Or "man ed" for an even older reference. I admit, I don't see any "left alternative before right" stuff in there. I guess I've been reading the code too much (the *ultimate* spec, as any UNIX hacker knows...). Just another RE hacker, -- /=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\ | on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III | | merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn | \=Cute Quote: "Welcome to Portland, Oregon, home of the California Raisins!"=/
worley@compass.com (Dale Worley) (07/06/90)
From: merlyn@iwarp.intel.com (Randal Schwartz) In article <1990Jul5.135434.11673@uvaarpa.Virginia.EDU>, worley@compass (Dale Worley) writes: | OK, where is it written down? Say "man egrep" on any sane machine. Or "man ed" for an even older reference. Well so it does. However, that's new with SunOS 4.0. On 3.5 "man egrep" said there was no such topic -- you had to say "man grep". (Whether you consider SunOS "sane" is a matter of debate...) I will also note that the Perl manual page nowhere mentions the egrep man page -- it mentions the "version 8 regexp routines", which I don't have, whatever they are. ("man ed" is useless, because ed's regular expressions are only a tiny subset of Perl's.) But how about putting this stuff into the manual page? I'm forever trying to figure out exactly what the rules are for regexps in program foo, because they're rarely exactly documented, or given as "exactly like program bar except when there's a new moon and you're wearing a yellow necktie..." It'd be nice if Perl did it right. Gripe, gripe, Dale Worley Compass, Inc. worley@compass.com -- Be nice. If you can't be nice, be good. If you can't be good, be careful. If you can't be careful, name it after me.