[comp.lang.perl] Counting characters

Martin.Ward%durham.ac.uk@pucc.PRINCETON.EDU (Martin Ward) (09/24/90)

evm@math.lsa.umich.edu suggested the following perl for counting the
occurrences of a character in a file:

cat file | perl -ne '$c += tr/A/A/; if (eof()) print "$c\n";'

as a replacement for:

cat file | tr -cd 'A' | wc -c

(A perl equivalent of "tr -cd 'xyz'" is "s/[xyz]+//go;" by the way).

Obviously the "cat file" and test of eof() are not necessary eg:

perl -e 'while (<>) $d += tr/A/A/; print "$c\n";' file

However, when treating files as binary information (rather than text
structured in lines) it is often more efficient to read the file in
blocks thus:

#! /usr/bin/perl
open (IN,"file");
for (;;) 
  $len = read(IN, $_, 10240);
  $c += tr/A/A/;
  last if ($len < 10240);

print "$c\n";

Here the blocksize is 10240 characters - experiment for the best size on
your machine, which will depend on the particular operation. This is
more portable than reading the whole file in one operation (with
"undef $/; $_ = <>" which may run out of memory when reading a huge file
on a small machine (perl runs on Messy Dos as well as Sun 3/80's with 64
Meg of swap space!). I find that block sizes over 100k don't make much
difference (in fact for the character count problem 10240 bytes was
optimal on my Sun 3/80). This technique also works in "degenerate" cases
such as a file of one million newline characters!

You can even use the technique for "line-based" programs, (programs that
treat files as one record per line). Instead of:

while (<>) 
  ... deal with the record in $_ ...


Try:

$* = 1;	# allow embedded newlines in records
$block = 2000;	# try different values on a "large" file
open(IN, $#ARGV >= 0 ? $ARGV[0] : "-");
	# open stdin or file on command line
for (;;) 
  $len = read(IN, $_, $block);
  $_ .= <IN> if ($len == $block); # read the end of last record
	# the above is needed if you will be searching for words
  study; # experiment to see if this helps or not
  ... deal with the record in $_ ...
  last if ($len < $block);


This gives most of the speedup of reading the whole file in one go,
without having to worry about running out of memory when reading a huge
file on a small machine. With one of my programs which does lots of
substitutions on each line of the file this version ran over twice as
fast. The perl character counter ran 3 to 4 times faster than the tr/wc
version! Specialised utilities like "tr" are not necessarily more
efficiant than general purpose utilitied like perl (especially if the
general purpose utility is written by a genius like Larry! :-) ).

Do others have experience of similar optimisations?

		Martin.

JANET: Martin.Ward@uk.ac.durham    Internet (eg US): Martin.Ward@DURHAM.AC.UK
or if that fails:  Martin.Ward%uk.ac.durham@nfsnet-relay.ac.uk
or even: Martin.Ward%DURHAM.AC.UK@CUNYVM.CUNY.EDU
BITNET: IN%"Martin.Ward@DURHAM.AC.UK" UUCP:...!mcvax!ukc!durham!Martin.Ward