pmoore@cix.compulink.co.uk (Paul Moore) (05/14/91)
This is one of those problems which I am convinced ought to have a simple (probably one-line) solution in perl, but I sure can't find it... I have a string, which contains a piece of text. I also have a regular expression. I want to count the number of times the RE appears in the string. I am aware that obnoxious REs, such as ones which match the empty string, and ones which overlap themselves, can make even *defining* the idea of "the number of times this RE appears in this string" difficult, but for straightforward cases the intention is clear. As an example (this is the task which first made me want to do this), I have a file, which has been copied from an MS-DOS box to my (non-MS-DOS) machine. So the lines in the file are delimited by "\r\n", and not just "\n". I have slurped the file into a string, in order to do some processing, and I need to count the number of lines. So what I want to do is count the number of occurrences of the string "\r\n" in the string. IE, open(DOS,"Ms-dos-file"); undef $/; $str = <DOS>; # Slurp .... processing on $str ... $lines = &count($str, "\r\n"); # Somehow... .... more processing ... The only way I can see, which works for a general RE, is $count = ($str =~ s/RE/$&/g); but the idea of doing global substitution, and using $&, strikes me as a bit inefficient... Another example, which shows why a general RE is better than just a string, is if I am trying to write a wc clone. So we have open(FILE, $ARGV[1]); undef $/; $str = <FILE>; $chars = length($str); # Don't worry about funny line terminators this time, and note # that we can use the return value of tr/// for single character # counts... $lines = ($str =~ tr/\n//); It seems to me that a nice way of counting words would be to count the occurrences of the pattern /\b/, and divide by 2. With perl's blindingly efficient pattern matching, this may be a very fast method. Obviously, in most individual cases, there are alternative ways of doing what I want. However, counting REs strikes me as a very "perl-ish" sort of activity, and I would have expected it to be built in, somehow. Perhaps as the return value of m// (which specifically isn't the case). Comments, anyone? Gustav. PS Sorry if this has already appeared, but I don't think it made it out of my system... E-Mail: pmoore%cix@ukc.ac.uk or: gustav@tharr.UUCP
tchrist@convex.COM (Tom Christiansen) (05/14/91)
From the keyboard of Paul Moore <pmoore@cix.compulink.co.uk>:
:This is one of those problems which I am convinced ought to have a simple
:(probably one-line) solution in perl, but I sure can't find it...
:I have a string, which contains a piece of text. I also have a regular
:expression. I want to count the number of times the RE appears in the
:string.
:As an example (this is the task which first made me want to do this), I
:have a file, which has been copied from an MS-DOS box to my (non-MS-DOS)
:machine. So the lines in the file are delimited by "\r\n", and not just
:"\n". I have slurped the file into a string, in order to do some processing,
:and I need to count the number of lines. So what I want to do is count the
:number of occurrences of the string "\r\n" in the string.
: open(DOS,"Ms-dos-file");
: undef $/;
: $str = <DOS>; # Slurp
: .... processing on $str ...
: $lines = &count($str, "\r\n"); # Somehow...
: .... more processing ...
I don't know what else you're doing, but I would think that slurping
is a pretty inefficient way. I usually try to avoid it. It sure does
make some things easier, though.
:The only way I can see, which works for a general RE, is
: $count = ($str =~ s/RE/$&/g);
:but the idea of doing global substitution, and using $&, strikes me as
:a bit inefficient...
You could make it faster if you could throw out the $&, but
that's not good for a general routine.
:Another example, which shows why a general RE is better than just a string,
:is if I am trying to write a wc clone. So we have
: open(FILE, $ARGV[1]);
: undef $/;
: $str = <FILE>;
: $chars = length($str);
: # Don't worry about funny line terminators this time, and note
: # that we can use the return value of tr/// for single character
: # counts...
: $lines = ($str =~ tr/\n//);
:It seems to me that a nice way of counting words would be to count the
:occurrences of the pattern /\b/, and divide by 2. With perl's blindingly
:efficient pattern matching, this may be a very fast method.
:Obviously, in most individual cases, there are alternative ways of doing
:what I want. However, counting REs strikes me as a very "perl-ish" sort
:of activity, and I would have expected it to be built in, somehow.
:Perhaps as the return value of m// (which specifically isn't the case).
:Comments, anyone?
Larry has posted musings about adding a /g switch to the m// operator, or
making a g// operator. There are two things this could do:
$count = ($str =~ /pat/g);
would get what you want. Another possibility is to keep some state
around, as in
while ($str =~ /pat/g) {
$len += length $`;
do munge($&);
}
I'm not sure that these two uses are compatible. To overload the
two uses would (to my mind) mean Larry would have to have it know
whether it's in a loop, which is even more context-sensitivity
in a language where folks are already shooting themselves in the
foot with context anyway.
The first use would easier to implement, I think, and more useful at least
in that I believe it would get used more.
For the 2nd use, we could use /i for an incremental match, but
no, that's taken. How about /p? No, folks'll expect that to
print the thing, as in sed. Other ideas?
On wc, here's a wc clone I once wrote. I don't slurp for speed's sake.
#!/usr/bin/perl -n
$lines++;
$chars += length;
$words += s/\S+//g;
next unless eof;
printf " %7d %7d %7d %s\n", $lines, $words, $chars, ($ARGV eq '-'?'':$ARGV);
$tlines += $lines;
$twords += $words;
$tchars += $chars;
$chars = $words = $lines = 0;
printf " %7d %7d %7d %s\n", $tlines, $twords, $tchars, "total"
if $files++ && eof();
It's a lot slower than the C version. Probably the s///g is what's
slowing it down. If that line could be changed to
$words += /\S+/g;
and Larry were to implement this in any reasonably efficient manner,
it would probably run much faster.
But hey, at least it gets 'wc /vmunix' right. :-)
I don't use $. and close(ARGV) because it confuses the program.
Do you all see that code up there just YEARNING for
($tlines, $twords, $tchars) += ($lines, $words, $chars);
I know, I know... along that road lies APL and madness.
--tom
--
Tom Christiansen tchrist@convex.com convex!tchrist
"So much mail, so little time."
rbj@uunet.uu.net (Root Boy Jim) (05/16/91)
tchrist@convex.COM (Tom Christiansen) writes: >From the keyboard of Paul Moore <pmoore@cix.compulink.co.uk>: >:This is one of those problems which I am convinced ought to have a simple >:(probably one-line) solution in perl, but I sure can't find it... Truly. >:I have a string, which contains a piece of text. I also have a regular >:expression. I want to count the number of times the RE appears in the >:string. Quite simply, the answer is: split(/RE/,exp) - 1; I don't know why Tom missed the easy answer after giving the hard ones. >:As an example (this is the task which first made me want to do this), I >:have a file, which has been copied from an MS-DOS box to my (non-MS-DOS) >:machine. So the lines in the file are delimited by "\r\n", and not just >:"\n". I have slurped the file into a string, in order to do some processing, >:and I need to count the number of lines. So what I want to do is count the >:number of occurrences of the string "\r\n" in the string. > > I don't know what else you're doing, but I would think that slurping > is a pretty inefficient way. I usually try to avoid it. It sure does > make some things easier, though. You don't have to slurp the whole file. Just set $/ to say, a space, or anything other than \r or \n. With a bit of memory you could also use read or sysread. Remember to paste trailing \r's onto the beginning of the next block. >I know, I know... along that road lies APL and madness. Too late. Perl is already weirder than APL. Uglier too. APL is mathematically pure. Perl is engineering and computer science at warp speed. APL handles arrays of arbitrary dimensions. Perl's objects are only one dimension, but may be associative. APL has no operator precedence (I consider this a plus), but is weak on control flow. Both have numeric and character data types, but Perl has regexps and common string operators builtin. Perl also interfaces to the operating system better, most likely because it was designed on a reasonable one. The fact that both languages inspire one-liners (and thos who write them :-) is perhaps their greatest common feature. -- [rbj@uunet 1] stty sane unknown mode: sane
tchrist@convex.COM (Tom Christiansen) (05/17/91)
From the keyboard of rbj@uunet.uu.net (Root Boy Jim): :>:I have a string, which contains a piece of text. I also have a regular :>:expression. I want to count the number of times the RE appears in the :>:string. : :Quite simply, the answer is: split(/RE/,exp) - 1; : :I don't know why Tom missed the easy answer after giving the hard ones. Funny you should mention that. Believe it or not, I just came back from thinking about all this a bunch, and was about the post the split() solution, and here you'd gone and beaten me to it. There are a couple of problems in using split for this. I think it has more overhead than it needs to have. If all you want is the count of the exprs, there's no reason to go making all those @_ values that you'll be creating as a side-effect of the split. If you could say: $count = /regexp/g; it would not need to create all those values, and seems more intuitive. Of course, if you said: @array = /stuff (regexp)/g; this is effectively the same as @array = grep($i++%2, split(/stuff (regexp)/)); except that once again, it's not utterly intuitive and will go making more tmp values than it really needs to -- although only twice as many. I've also been thinking more about while (/foo/) { and somehow making that an iterator that starts the match from where it left off. I think a decent way to do this would be to use a /n flag indicating "next match". Thus the syntax would be //n, or m//n, not n//. It's really still a match, just with a special variation, so doesn't particularly merit an entirely new operator. Perl would keep a pointer into the string being matched against, advancing it with each match until it ran out. while ($foo =~ /bar/n) { Is certainly one possibility, but another possible use would be: if (/foo/ && /bar/n && /baz/n) which might be faster than if (/foo.*bar.*baz/) A question is what you do on failure. For example, does this make sense: if (/foo/) { if (/bar/n) { } elsif (/baz/n) { } } If the /bar/n failed, could the /baz/n search start from the same place as the /bar/n started? Another question is when to reset your state. Do you have to know when the variable you're matching against has been written? Do you reset everytime the variable is matched against without the /n switch? On further contemplation, I think for efficiency you'd want to make the user put in a /n if he ever wanted to do a next match. Otherwise it'd be too much overhead. That makes the above fragment like this: if (/foo/n) { if (/bar/n) { } elsif (/baz/n) { } } I still don't know when to reset the state. And does /n make sense for the s/// operator? I think this /n switch needs a bit more thought and discussion, maybe from some of you who've done more complex pattern operations in other languages. The /g switch, on the other hand, seems much more straight-forward and could work just as I've described it above without shocking anyone. Larry, what's your take on all this? :>I know, I know... along that road lies APL and madness. : :Too late. Perl is already weirder than APL. Uglier too. Oh good, does that mean we'll get @a += @b; @c = @a + @b; one of these days then? :-) Speaking of array operations, consider this. You have an array of colors and values, as from the perl man page: %map = ('red', 0x00f, 'blue', 0x0f0, 'green', 0xf00); So that $map{'red'} == 0x00f and so on. What if you want to invert the array so you can compute $map{0x00f}? Well, certainly you can do this semi-awkishly: for $color (keys %map) { $nmap{$map{$color} = $color; } or even this in a lispy, semi-mapcar kind of way: grep( $nmap{$map{$_}} = $_, keys %map ); but even grep is too much work. I think the true perl idiom is @nmap{values %map} = keys %map; which works just fine, is quite obvious about what it's doing, and seems more in line with the Perlian Way. I don't believe I've ever seen anyone do that before. --tom -- Tom Christiansen tchrist@convex.com convex!tchrist "So much mail, so little time."
phillips@cs.ubc.ca (George Phillips) (05/21/91)
In article <1991May17.132403.12104@convex.com> tchrist@convex.COM (Tom Christiansen) writes: >From the keyboard of rbj@uunet.uu.net (Root Boy Jim): >:>I know, I know... along that road lies APL and madness. >: >:Too late. Perl is already weirder than APL. Uglier too. > >Oh good, does that mean we'll get > > @a += @b; > @c = @a + @b; > >one of these days then? :-) Nah, too conventional. Go for an extension to the array to associative array assignment. We can already do this: %a = ("bip", "bop", "boop", "beep" ); So why not: %a .= ("more", "less", "more", "even less" ); Which should give $a{"more"} eq "lesseven less". Now round it out to support "+=", "*=" and all the other op assignment operators. Not only would this be cute, but it could save as much as 3 lines of code in every hundredth perl script you write. %p .= (Just," another ",Just,Perl,Just," hacker,");print %p."\n"; -- George Phillips phillips@cs.ubc.ca {alberta,uw-beaver,uunet}!ubc-cs!phillips
rbj@uunet.uu.net (Root Boy Jim) (05/22/91)
tchrist@convex.COM (Tom Christiansen) writes: ?From the keyboard of rbj@uunet.uu.net (Root Boy Jim): ?: ?:Quite simply, the answer is: split(/RE/,exp) - 1; ?: ?There are a couple of problems in using split for this. I think it ?has more overhead than it needs to have. If all you want is the ?count of the exprs, there's no reason to go making all those @_ ?values that you'll be creating as a side-effect of the split. Yes. It's merely the most conceptually simplest. ? $count = /regexp/g; ? ?it would not need to create all those values, and seems more intuitive. I second the notion. It says what it means. ?I've also been thinking more about ? ? while (/foo/) { ? ?and somehow making that an iterator that starts the match from where it ?left off. I think a decent way to do this would be to use a /n flag ?indicating "next match". Thus the syntax would be //n, or m//n. I am leery of these operators with embedded state. It's just another thing that has to be cleaned up. ? while ($foo =~ /bar/n) { ? ?Is certainly one possibility, but another possible use would be: ? ? if (/foo/ && /bar/n && /baz/n) This really bothers me. It is one thing for each textual operator to save its own state, quite another to refer to someplace different in the program. Yes, I can see that they match the same variable. Consider the following program: while ($a=<*>) { if (++$c & 1) { $b=<*>; } else { $b=''; } print "$a\t$b\n"; } You can see that each operator retains its own state. A closure if you will. In the m//n case, the remembered position would have to be stored with the variable perhaps. ?which might be faster than ? ? if (/foo.*bar.*baz/) But speed isn't everything. ?A question is what you do on failure. For example, does this make sense: ? ? if (/foo/) { ? if (/bar/n) { } ? elsif (/baz/n) { } ? } ? ?If the /bar/n failed, could the /baz/n search start from the same place ?as the /bar/n started? Yes, but it takes awhile to figger that out. Advance the pointer only on successful matches. ?I think this /n switch needs a bit more thought and discussion, maybe from ?some of you who've done more complex pattern operations in other languages. I think it should be killed right here. I think we would need an explicit position argument. Such a beast almost exists: index. If only it did RE's. Then the code would be something like: for ($cnt=$pos=0; $pos=index($string,$RE,$pos); $pos+=length($&)) { $cnt++; } ?The /g switch, on the other hand, seems much more straight-forward and ?could work just as I've described it above without shocking anyone. ?Larry, what's your take on all this? ? ?:>I know, I know... along that road lies APL and madness. ?: ?:Too late. Perl is already weirder than APL. Uglier too. ? ?Oh good, does that mean we'll get ? ? @a += @b; ? @c = @a + @b; ? ?one of these days then? :-) Not to mention @a += $b; ?or even this in a lispy, semi-mapcar kind of way: ? ? grep( $nmap{$map{$_}} = $_, keys %map ); ? ?but even grep is too much work. I think the true perl idiom is ? ? @nmap{values %map} = keys %map; ? ?which works just fine, is quite obvious about what it's doing, and seems ?more in line with the Perlian Way. I don't believe I've ever seen anyone ?do that before. LISP allows you to search an alist either way. This is obviously better than two separate structures. And I believe APL allows you to do the equivalent of "@a[1,3,5] = (1,9,25)". However, APL doesn't have associative arrays. -- [rbj@uunet 1] stty sane unknown mode: sane
jmm@eci386.uucp (John Macdonald) (05/23/91)
In article <1991May21.184545.26905@uunet.uu.net> rbj@uunet.uu.net (Root Boy Jim) writes: |tchrist@convex.COM (Tom Christiansen) writes: |?I've also been thinking more about |? |? while (/foo/) { |? |?and somehow making that an iterator that starts the match from where it |?left off. I think a decent way to do this would be to use a /n flag |?indicating "next match". Thus the syntax would be //n, or m//n. | |I am leery of these operators with embedded state. It's just |another thing that has to be cleaned up. If this sort of thing is going to be added, then perhaps it would be appropriate to add iterators as a formal object within the language, merging the concepts of globbing, ARGV handle scanning using <>, numerical and string iterators ('aa'..'zz'), and the (as proposed) some sort of RE iterators. Iterators as a class could have the operations of rewind, check if done, start next iteration, terminate the iterator, and any others that seem appropriate. Most of these operations fit in well with embedding the iterator in a for or while loop and using the loop control statements (next, last, first[a new one to rewind the iterator back to its beginning]). Perl already has picked up concept from lots of other languages, maybe its time to get some from ICON. -- sendmail - as easy to operate and as painless as using | John Macdonald manually powered dental tools on yourself - John R. MacMillan | jmm@eci386