Martin.Ward%durham.ac.uk@pucc.PRINCETON.EDU (Martin Ward) (09/24/90)
evm@math.lsa.umich.edu suggested the following perl for counting the occurrences of a character in a file: cat file | perl -ne '$c += tr/A/A/; if (eof()) print "$c\n";' as a replacement for: cat file | tr -cd 'A' | wc -c (A perl equivalent of "tr -cd 'xyz'" is "s/[xyz]+//go;" by the way). Obviously the "cat file" and test of eof() are not necessary eg: perl -e 'while (<>) $d += tr/A/A/; print "$c\n";' file However, when treating files as binary information (rather than text structured in lines) it is often more efficient to read the file in blocks thus: #! /usr/bin/perl open (IN,"file"); for (;;) $len = read(IN, $_, 10240); $c += tr/A/A/; last if ($len < 10240); print "$c\n"; Here the blocksize is 10240 characters - experiment for the best size on your machine, which will depend on the particular operation. This is more portable than reading the whole file in one operation (with "undef $/; $_ = <>" which may run out of memory when reading a huge file on a small machine (perl runs on Messy Dos as well as Sun 3/80's with 64 Meg of swap space!). I find that block sizes over 100k don't make much difference (in fact for the character count problem 10240 bytes was optimal on my Sun 3/80). This technique also works in "degenerate" cases such as a file of one million newline characters! You can even use the technique for "line-based" programs, (programs that treat files as one record per line). Instead of: while (<>) ... deal with the record in $_ ... Try: $* = 1; # allow embedded newlines in records $block = 2000; # try different values on a "large" file open(IN, $#ARGV >= 0 ? $ARGV[0] : "-"); # open stdin or file on command line for (;;) $len = read(IN, $_, $block); $_ .= <IN> if ($len == $block); # read the end of last record # the above is needed if you will be searching for words study; # experiment to see if this helps or not ... deal with the record in $_ ... last if ($len < $block); This gives most of the speedup of reading the whole file in one go, without having to worry about running out of memory when reading a huge file on a small machine. With one of my programs which does lots of substitutions on each line of the file this version ran over twice as fast. The perl character counter ran 3 to 4 times faster than the tr/wc version! Specialised utilities like "tr" are not necessarily more efficiant than general purpose utilitied like perl (especially if the general purpose utility is written by a genius like Larry! :-) ). Do others have experience of similar optimisations? Martin. JANET: Martin.Ward@uk.ac.durham Internet (eg US): Martin.Ward@DURHAM.AC.UK or if that fails: Martin.Ward%uk.ac.durham@nfsnet-relay.ac.uk or even: Martin.Ward%DURHAM.AC.UK@CUNYVM.CUNY.EDU BITNET: IN%"Martin.Ward@DURHAM.AC.UK" UUCP:...!mcvax!ukc!durham!Martin.Ward