[comp.lang.perl] Counting RE occurrences

pmoore@cix.compulink.co.uk (Paul Moore) (05/14/91)

This is one of those problems which I am convinced ought to have a simple
(probably one-line) solution in perl, but I sure can't find it...

I have a string, which contains a piece of text. I also have a regular
expression. I want to count the number of times the RE appears in the
string. I am aware that obnoxious REs, such as ones which match the empty
string, and ones which overlap themselves, can make even *defining* the
idea of "the number of times this RE appears in this string" difficult,
but for straightforward cases the intention is clear.

As an example (this is the task which first made me want to do this), I
have a file, which has been copied from an MS-DOS box to my (non-MS-DOS)
machine. So the lines in the file are delimited by "\r\n", and not just
"\n". I have slurped the file into a string, in order to do some processing,
and I need to count the number of lines. So what I want to do is count the
number of occurrences of the string "\r\n" in the string.

IE,

        open(DOS,"Ms-dos-file");
        undef $/;
        $str = <DOS>;                      # Slurp
        .... processing on $str ...
        $lines = &count($str, "\r\n");     # Somehow...
        .... more processing ...

The only way I can see, which works for a general RE, is

        $count = ($str =~ s/RE/$&/g);

but the idea of doing global substitution, and using $&, strikes me as
a bit inefficient...

Another example, which shows why a general RE is better than just a string,
is if I am trying to write a wc clone. So we have

        open(FILE, $ARGV[1]);
        undef $/;
        $str = <FILE>;

        $chars = length($str);

        # Don't worry about funny line terminators this time, and note
        # that we can use the return value of tr/// for single character
        # counts...
        $lines = ($str =~ tr/\n//);

It seems to me that a nice way of counting words would be to count the
occurrences of the pattern /\b/, and divide by 2. With perl's blindingly
efficient pattern matching, this may be a very fast method.

Obviously, in most individual cases, there are alternative ways of doing
what I want. However, counting REs strikes me as a very "perl-ish" sort
of activity, and I would have expected it to be built in, somehow.
Perhaps as the return value of m// (which specifically isn't the case).

Comments, anyone?

Gustav.

PS Sorry if this has already appeared, but I don't think it made it out of
   my system...

E-Mail: pmoore%cix@ukc.ac.uk
    or: gustav@tharr.UUCP

tchrist@convex.COM (Tom Christiansen) (05/14/91)

From the keyboard of Paul Moore <pmoore@cix.compulink.co.uk>:
:This is one of those problems which I am convinced ought to have a simple
:(probably one-line) solution in perl, but I sure can't find it...

:I have a string, which contains a piece of text. I also have a regular
:expression. I want to count the number of times the RE appears in the
:string. 

:As an example (this is the task which first made me want to do this), I
:have a file, which has been copied from an MS-DOS box to my (non-MS-DOS)
:machine. So the lines in the file are delimited by "\r\n", and not just
:"\n". I have slurped the file into a string, in order to do some processing,
:and I need to count the number of lines. So what I want to do is count the
:number of occurrences of the string "\r\n" in the string.

:        open(DOS,"Ms-dos-file");
:        undef $/;
:        $str = <DOS>;                      # Slurp
:        .... processing on $str ...
:        $lines = &count($str, "\r\n");     # Somehow...
:        .... more processing ...

    I don't know what else you're doing, but I would think that slurping
    is a pretty inefficient way.  I usually try to avoid it.  It sure does
    make some things easier, though.

:The only way I can see, which works for a general RE, is
:        $count = ($str =~ s/RE/$&/g);

:but the idea of doing global substitution, and using $&, strikes me as
:a bit inefficient...

    You could make it faster if you could throw out the $&, but
    that's not good for a general routine.


:Another example, which shows why a general RE is better than just a string,
:is if I am trying to write a wc clone. So we have

:        open(FILE, $ARGV[1]);
:        undef $/;
:        $str = <FILE>;
:        $chars = length($str);
:        # Don't worry about funny line terminators this time, and note
:        # that we can use the return value of tr/// for single character
:        # counts...
:        $lines = ($str =~ tr/\n//);

:It seems to me that a nice way of counting words would be to count the
:occurrences of the pattern /\b/, and divide by 2. With perl's blindingly
:efficient pattern matching, this may be a very fast method.

:Obviously, in most individual cases, there are alternative ways of doing
:what I want. However, counting REs strikes me as a very "perl-ish" sort
:of activity, and I would have expected it to be built in, somehow.
:Perhaps as the return value of m// (which specifically isn't the case).

:Comments, anyone?


Larry has posted musings about adding a /g switch to the m// operator, or
making a g// operator.  There are two things this could do:

       $count = ($str =~ /pat/g);

would get what you want.  Another possibility is to keep some state
around, as in 
    
	while ($str =~ /pat/g) {
	    $len += length $`;
	    do munge($&);
	}

I'm not sure that these two uses are compatible.  To overload the 
two uses would (to my mind) mean Larry would have to have it know
whether it's in a loop, which is even more context-sensitivity
in a language where folks are already shooting themselves in the
foot with context anyway.

The first use would easier to implement, I think, and more useful at least
in that I believe it would get used more.

For the 2nd use, we could use /i for an incremental match, but
no, that's taken.  How about /p?  No, folks'll expect that to 
print the thing, as in sed.  Other ideas?


On wc, here's a wc clone I once wrote.  I don't slurp for speed's sake.

    #!/usr/bin/perl -n
    $lines++;
    $chars += length;
    $words += s/\S+//g;
    next unless eof;
    printf " %7d %7d %7d %s\n", $lines, $words, $chars, ($ARGV eq '-'?'':$ARGV);
    $tlines += $lines; 
    $twords += $words; 
    $tchars += $chars; 
    $chars = $words = $lines = 0;
    printf " %7d %7d %7d %s\n", $tlines, $twords, $tchars, "total" 
	if $files++ && eof();

It's a lot slower than the C version.  Probably the s///g is what's 
slowing it down.  If that line could be changed to 
    
    $words += /\S+/g;

and Larry were to implement this in any reasonably efficient manner,
it would probably run much faster.

But hey, at least it gets 'wc /vmunix' right. :-)

I don't use $. and close(ARGV) because it confuses the program.

Do you all see that code up there just YEARNING for 
    
    ($tlines, $twords, $tchars) += ($lines, $words, $chars);

I know, I know... along that road lies APL and madness.

--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
		"So much mail, so little time."

rbj@uunet.uu.net (Root Boy Jim) (05/16/91)

tchrist@convex.COM (Tom Christiansen) writes:
>From the keyboard of Paul Moore <pmoore@cix.compulink.co.uk>:
>:This is one of those problems which I am convinced ought to have a simple
>:(probably one-line) solution in perl, but I sure can't find it...

Truly.

>:I have a string, which contains a piece of text. I also have a regular
>:expression. I want to count the number of times the RE appears in the
>:string. 

Quite simply, the answer is: split(/RE/,exp) - 1;

I don't know why Tom missed the easy answer after giving the hard ones.

>:As an example (this is the task which first made me want to do this), I
>:have a file, which has been copied from an MS-DOS box to my (non-MS-DOS)
>:machine. So the lines in the file are delimited by "\r\n", and not just
>:"\n". I have slurped the file into a string, in order to do some processing,
>:and I need to count the number of lines. So what I want to do is count the
>:number of occurrences of the string "\r\n" in the string.
>
>    I don't know what else you're doing, but I would think that slurping
>    is a pretty inefficient way.  I usually try to avoid it.  It sure does
>    make some things easier, though.

You don't have to slurp the whole file. Just set $/ to say, a space,
or anything other than \r or \n. With a bit of memory you could
also use read or sysread. Remember to paste trailing \r's onto
the beginning of the next block.

>I know, I know... along that road lies APL and madness.

Too late. Perl is already weirder than APL. Uglier too.

APL is mathematically pure.
Perl is engineering and computer science at warp speed.

APL handles arrays of arbitrary dimensions.
Perl's objects are only one dimension, but may be associative.

APL has no operator precedence (I consider this a plus),
but is weak on control flow.

Both have numeric and character data types, but Perl has regexps
and common string operators builtin. Perl also interfaces to the
operating system better, most likely because it was designed
on a reasonable one.

The fact that both languages inspire one-liners (and thos who
write them :-) is perhaps their greatest common feature.

-- 
		[rbj@uunet 1] stty sane
		unknown mode: sane

tchrist@convex.COM (Tom Christiansen) (05/17/91)

From the keyboard of rbj@uunet.uu.net (Root Boy Jim):
:>:I have a string, which contains a piece of text. I also have a regular
:>:expression. I want to count the number of times the RE appears in the
:>:string. 
:
:Quite simply, the answer is: split(/RE/,exp) - 1;
:
:I don't know why Tom missed the easy answer after giving the hard ones.

Funny you should mention that.  Believe it or not, I just came back from
thinking about all this a bunch, and was about the post the split()
solution, and here you'd gone and beaten me to it.

There are a couple of problems in using split for this.  I think it
has more overhead than it needs to have.  If all you want is the
count of the exprs, there's no reason to go making all those @_ 
values that you'll be creating as a side-effect of the split.

If you could say:

    $count = /regexp/g;

it would not need to create all those values, and seems more intuitive.  

Of course, if you said:

    @array = /stuff (regexp)/g;

this is effectively the same as

    @array = grep($i++%2, split(/stuff (regexp)/));

except that once again, it's not utterly intuitive and will go making
more tmp values than it really needs to -- although only twice as many.

I've also been thinking more about 

    while (/foo/) {

and somehow making that an iterator that starts the match from where it
left off.  I think a decent way to do this would be to use a /n flag 
indicating "next match".  Thus the syntax would be //n, or m//n, not
n//.  It's really still a match, just with a special variation, so doesn't
particularly merit an entirely new operator.  Perl would keep a pointer
into the string being matched against, advancing it with each match
until it ran out.

    while ($foo =~ /bar/n) {

Is certainly one possibility, but another possible use would be:

    if (/foo/ && /bar/n && /baz/n)

which might be faster than 

    if (/foo.*bar.*baz/)


A question is what you do on failure.  For example, does this make sense:

    if (/foo/) {
	if (/bar/n) { } 
	elsif (/baz/n) { } 
    } 

If the /bar/n failed, could the /baz/n search start from the same place
as the /bar/n started?

Another question is when to reset your state.  Do you have to know when
the variable you're matching against has been written?  Do you reset
everytime the variable is matched against without the /n switch?  On
further contemplation, I think for efficiency you'd want to make the user
put in a /n if he ever wanted to do a next match.  Otherwise it'd be too
much overhead. That makes the above fragment like this:

    if (/foo/n) {
	if (/bar/n) { } 
	elsif (/baz/n) { } 
    } 

I still don't know when to reset the state.  And does /n make sense for
the s/// operator?

I think this /n switch needs a bit more thought and discussion, maybe from
some of you who've done more complex pattern operations in other languages.  

The /g switch, on the other hand, seems much more straight-forward and
could work just as I've described it above without shocking anyone.
Larry, what's your take on all this?

:>I know, I know... along that road lies APL and madness.
:
:Too late. Perl is already weirder than APL. Uglier too.

Oh good, does that mean we'll get

    @a += @b;
    @c = @a + @b;

one of these days then? :-)

Speaking of array operations, consider this.  You have an array
of colors and values, as from the perl man page:

    %map = ('red', 0x00f, 'blue', 0x0f0, 'green', 0xf00);

So that $map{'red'} == 0x00f and so on.  What if you want to 
invert the array so you can compute $map{0x00f}?  Well, certainly
you can do this semi-awkishly:

    for $color (keys %map) {
	$nmap{$map{$color} = $color;
    } 

or even this in a lispy, semi-mapcar kind of way:

    grep( $nmap{$map{$_}} = $_, keys %map ); 

but even grep is too much work.  I think the true perl idiom is 

    @nmap{values %map} = keys %map;

which works just fine, is quite obvious about what it's doing, and seems
more in line with the Perlian Way.  I don't believe I've ever seen anyone 
do that before.

--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
		"So much mail, so little time."

phillips@cs.ubc.ca (George Phillips) (05/21/91)

In article <1991May17.132403.12104@convex.com> tchrist@convex.COM (Tom Christiansen) writes:
>From the keyboard of rbj@uunet.uu.net (Root Boy Jim):
>:>I know, I know... along that road lies APL and madness.
>:
>:Too late. Perl is already weirder than APL. Uglier too.
>
>Oh good, does that mean we'll get
>
>    @a += @b;
>    @c = @a + @b;
>
>one of these days then? :-)

Nah, too conventional.  Go for an extension to the array to associative
array assignment.  We can already do this:

%a = ("bip", "bop", "boop", "beep" );

So why not:

%a .= ("more", "less", "more", "even less" );

Which should give $a{"more"} eq "lesseven less".  Now round it out
to support "+=", "*=" and all the other op assignment operators.

Not only would this be cute, but it could save as much as 3 lines of
code in every hundredth perl script you write.

%p .= (Just," another ",Just,Perl,Just," hacker,");print %p."\n";

--
George Phillips phillips@cs.ubc.ca {alberta,uw-beaver,uunet}!ubc-cs!phillips

rbj@uunet.uu.net (Root Boy Jim) (05/22/91)

tchrist@convex.COM (Tom Christiansen) writes:
?From the keyboard of rbj@uunet.uu.net (Root Boy Jim):
?:
?:Quite simply, the answer is: split(/RE/,exp) - 1;
?:
?There are a couple of problems in using split for this.  I think it
?has more overhead than it needs to have.  If all you want is the
?count of the exprs, there's no reason to go making all those @_ 
?values that you'll be creating as a side-effect of the split.

Yes. It's merely the most conceptually simplest.

?    $count = /regexp/g;
?
?it would not need to create all those values, and seems more intuitive.  

I second the notion. It says what it means.

?I've also been thinking more about 
?
?    while (/foo/) {
?
?and somehow making that an iterator that starts the match from where it
?left off.  I think a decent way to do this would be to use a /n flag 
?indicating "next match".  Thus the syntax would be //n, or m//n.

I am leery of these operators with embedded state. It's just
another thing that has to be cleaned up.

?    while ($foo =~ /bar/n) {
?
?Is certainly one possibility, but another possible use would be:
?
?    if (/foo/ && /bar/n && /baz/n)

This really bothers me. It is one thing for each textual operator
to save its own state, quite another to refer to someplace different
in the program. Yes, I can see that they match the same variable.

Consider the following program:

	while ($a=<*>) {
		if (++$c & 1)  {
			$b=<*>;
		} else {
			$b='';
		}
		print "$a\t$b\n";
	}

You can see that each operator retains its own state.
A closure if you will. In the m//n case, the remembered
position would have to be stored with the variable perhaps.

?which might be faster than 
?
?    if (/foo.*bar.*baz/)

But speed isn't everything.

?A question is what you do on failure.  For example, does this make sense:
?
?    if (/foo/) {
?	if (/bar/n) { } 
?	elsif (/baz/n) { } 
?    } 
?
?If the /bar/n failed, could the /baz/n search start from the same place
?as the /bar/n started?

Yes, but it takes awhile to figger that out.
Advance the pointer only on successful matches.

?I think this /n switch needs a bit more thought and discussion, maybe from
?some of you who've done more complex pattern operations in other languages.  

I think it should be killed right here.

I think we would need an explicit position argument.
Such a beast almost exists: index. If only it did RE's.

Then the code would be something like:

	for ($cnt=$pos=0; $pos=index($string,$RE,$pos); $pos+=length($&))
		{ $cnt++; }

?The /g switch, on the other hand, seems much more straight-forward and
?could work just as I've described it above without shocking anyone.
?Larry, what's your take on all this?
?
?:>I know, I know... along that road lies APL and madness.
?:
?:Too late. Perl is already weirder than APL. Uglier too.
?
?Oh good, does that mean we'll get
?
?    @a += @b;
?    @c = @a + @b;
?
?one of these days then? :-)

Not to mention @a += $b;

?or even this in a lispy, semi-mapcar kind of way:
?
?    grep( $nmap{$map{$_}} = $_, keys %map ); 
?
?but even grep is too much work.  I think the true perl idiom is 
?
?    @nmap{values %map} = keys %map;
?
?which works just fine, is quite obvious about what it's doing, and seems
?more in line with the Perlian Way.  I don't believe I've ever seen anyone 
?do that before.

LISP allows you to search an alist either way.
This is obviously better than two separate structures.

And I believe APL allows you to do the equivalent of "@a[1,3,5] = (1,9,25)".
However, APL doesn't have associative arrays.
-- 
		[rbj@uunet 1] stty sane
		unknown mode: sane

jmm@eci386.uucp (John Macdonald) (05/23/91)

In article <1991May21.184545.26905@uunet.uu.net> rbj@uunet.uu.net (Root Boy Jim) writes:
|tchrist@convex.COM (Tom Christiansen) writes:
|?I've also been thinking more about 
|?
|?    while (/foo/) {
|?
|?and somehow making that an iterator that starts the match from where it
|?left off.  I think a decent way to do this would be to use a /n flag 
|?indicating "next match".  Thus the syntax would be //n, or m//n.
|
|I am leery of these operators with embedded state. It's just
|another thing that has to be cleaned up.

If this sort of thing is going to be added, then perhaps it
would be appropriate to add iterators as a formal object
within the language, merging the concepts of globbing, ARGV
handle scanning using <>, numerical and string iterators
('aa'..'zz'), and the (as proposed) some sort of RE iterators.

Iterators as a class could have the operations of rewind, check
if done, start next iteration, terminate the iterator, and any
others that seem appropriate.  Most of these operations fit in
well with embedding the iterator in a for or while loop and using
the loop control statements (next, last, first[a new one to rewind
the iterator back to its beginning]).

Perl already has picked up concept from lots of other languages,
maybe its time to get some from ICON.
-- 
sendmail - as easy to operate and as painless as using        | John Macdonald
manually powered dental tools on yourself - John R. MacMillan |   jmm@eci386