[comp.lang.perl] Why is this so slow?

ben@cunixf.cc.columbia.edu (Ben Fried) (12/07/90)

I've written a filter in perl to do printer page counting for DEC ln03
printers.  Right now it seems to be working fine, but it runs abysmally
slowly.  The vast majority of time is taken up in the following loop:

mainloop:
while(read(STDIN, $_, 2048)) {
    foreach $i ($[..length($_)) {
	$num_to_print = $i;
	$oc=$c;
	$c = substr($_, $i, 1);
	if ($tektronixmode) {
	    if (($c eq "\f") && ($oc eq "\033")) {
		$pageno++;
	    }
	} else {
	    if ($ascii_mode) {
		if ($c eq "\n") {
		    $lineno++;
		    if ($lineno == $lines_per_page) {
			$lineno = 0;
			$pageno++;
		    }
		}
	    } else {
		if ($c eq "\f") {
		    $lineno = 0;
		    $pageno++;
		}
	    }
	}

	if ($pageno >= $max_pages) {
	    $debug && print STDERR "aborted at page $pageno\n";
	    print STDOUT substr($_, $[, $num_to_print);
	    last mainloop;
	}

    }

    print STDOUT;
}

Can anyone give me some pointers to speed this up?  I already tried
using ord to cast $c and $oc to numbers, and using integer comparisons
agains the ords of \n, \f, and \033, but that actually ran slower than
the above code.  All that really matters is that I be able to count the
number of pages on the input stream, and be able to break out of the
loop after either (a) EOF on stdin, or (b) $pageno == $max_pages,
whichever comes first.

I was thinking of doing a split(//), and then using grep to count
occurrences, but couldn't think of a good way of stopping when $pageno
hits $max_pages.  Now that I think about it, maybe I should be able to
do this, I just have to come up with a more complicated EXPR for grep.
Perhaps it's time I went through the JAPH collection more closely.
Still, would splitting input into arrays and using grep be any faster?

Ben
--
Benjamin Fried
ben@cunixf.cc.columbia.edu				rutgers!columbia!ben

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (12/07/90)

In article <1990Dec6.194412.9132@cunixf.cc.columbia.edu> ben@cunixf.cc.columbia.edu (Ben Fried) writes:
: I've written a filter in perl to do printer page counting for DEC ln03
: printers.  Right now it seems to be working fine, but it runs abysmally
: slowly.  The vast majority of time is taken up in the following loop:
: 
: mainloop:
: while(read(STDIN, $_, 2048)) {
:     foreach $i ($[..length($_)) {
: 	$num_to_print = $i;
: 	$oc=$c;
: 	$c = substr($_, $i, 1);
: 	if ($tektronixmode) {
: 	    if (($c eq "\f") && ($oc eq "\033")) {
: 		$pageno++;
: 	    }
: 	} else {
: 	    if ($ascii_mode) {
: 		if ($c eq "\n") {
: 		    $lineno++;
: 		    if ($lineno == $lines_per_page) {
: 			$lineno = 0;
: 			$pageno++;
: 		    }
: 		}
: 	    } else {
: 		if ($c eq "\f") {
: 		    $lineno = 0;
: 		    $pageno++;
: 		}
: 	    }
: 	}
: 
: 	if ($pageno >= $max_pages) {
: 	    $debug && print STDERR "aborted at page $pageno\n";
: 	    print STDOUT substr($_, $[, $num_to_print);
: 	    last mainloop;
: 	}
: 
:     }
: 
:     print STDOUT;
: }

That looks like a reasonable C solution, though even in C I'd avoid testing
all those conditionals on every character.  Here's an equivalent in Perl.

if ($tektronixmode) {
    $/ = "\f";
    while (<>) {
	$pageno++ if /\033\f$/;
	&blowup, last if $pageno > $max_pages;
	print;
    }
}
elsif ($ascii_mode) {
    $max_lines = $max_pages * $lines_per_page;
    while (<>) {
	&blowup, last if $. > $max_lines;
	print;
    }
}
else {
    $/ = "\f";
    while (<>) {
	&blowup, last if $. > $max_pages;
    }
}

sub blowup {
    $debug && print STDERR "aborted at page $pageno\n";
}

I'm not sure that $ascii_mode is a valid distinction, however.  Perhaps
you really mean:

if ($tektronixmode) {
    $/ = "\f";
    while (<>) {
	$pageno++ if /\033\f$/;
	&blowup, last if $pageno > $max_pages;
	print;
    }
}
else {
    while (<>) {
	$lineno++;
	$pageno++, $lineno = 0 if /\f/;
	$pageno++, $lineno = 0 if $lineno > $lines_per_page;
	&blowup, last if $pageno > $max_pages;
	print;
    }
}

Or something like that...

Larry

merlyn@iwarp.intel.com (Randal Schwartz) (12/07/90)

In article <1990Dec6.194412.9132@cunixf.cc.columbia.edu>, ben@cunixf (Ben Fried) writes:
| mainloop:
| while(read(STDIN, $_, 2048)) {
|     foreach $i ($[..length($_)) {
| 	$num_to_print = $i;
| 	$oc=$c;
| 	$c = substr($_, $i, 1);
| 	if ($tektronixmode) {
| 	    if (($c eq "\f") && ($oc eq "\033")) {
| 		$pageno++;
| 	    }
| 	} else {
| 	    if ($ascii_mode) {
| 		if ($c eq "\n") {
| 		    $lineno++;
| 		    if ($lineno == $lines_per_page) {
| 			$lineno = 0;
| 			$pageno++;
| 		    }
| 		}
| 	    } else {
| 		if ($c eq "\f") {
| 		    $lineno = 0;
| 		    $pageno++;
| 		}
| 	    }
| 	}
| 
| 	if ($pageno >= $max_pages) {
| 	    $debug && print STDERR "aborted at page $pageno\n";
| 	    print STDOUT substr($_, $[, $num_to_print);
| 	    last mainloop;
| 	}
| 
|     }
| 
|     print STDOUT;
| }

Hmm.  That's probably the right way to do it in some other language
besides Perl.  (It smells vaguely BASIC-like, or maybe C-like.)

Anyway, as a first cut, I'd do something like:

$lines_per_page = 66; $max_pages = 100; # you forgot this
while (<>) {
	$lineno++;
	$pageno++, $lineno = 0 if /\f/ || ($lineno >= $lines_per_page);
	die "Aborted at page $pageno\n" if $pageno >= $max_pages;
	print;
}

Yes?  Maybe?  OK, so I didn't add the tek vs. ascii mode, but that'd
only make it one line longer. :-)

I think you'll find this runs a teensy-tiny bit faster. :-)

print "Just another Perl hacker,"
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Intel: putting the 'backward' in 'backward compatible'..."====/

ben@cunixf.cc.columbia.edu (Ben Fried) (12/07/90)

In <1990Dec6.205447.13500@iwarp.intel.com> Randal writes:

[ My source code ]

> Hmm.  That's probably the right way to do it in some other language
> besides Perl.  (It smells vaguely BASIC-like, or maybe C-like.)
> 
> Anyway, as a first cut, I'd do something like:
> 
> $lines_per_page = 66; $max_pages = 100; # you forgot this

They're set elsewhere in the program; sorry for not including that.

> while (<>) {
> 	$lineno++;
> 	$pageno++, $lineno = 0 if /\f/ || ($lineno >= $lines_per_page);
> 	die "Aborted at page $pageno\n" if $pageno >= $max_pages;
> 	print;
> }
> 
> Yes?  Maybe?  OK, so I didn't add the tek vs. ascii mode, but that'd
> only make it one line longer. :-)
> 
> I think you'll find this runs a teensy-tiny bit faster. :-)

I failed to adequately explain my problem: only in about 1/3 of the
cases are there newlines in the input, in the rest they may not appear
at all, or if they do they do not delimit lines, but appear as binary
data.  In those cases (plain ascii files), Randal's code is 100%
correct.  However, in the remainder of the cases, I cannot depend on
newlines or any other line (or record, or whatever you want to call it)
delimiter being present to allow easy use of the while (<>) construct.
In fact, one might see a sequence sort of like:

\n\f\n\n\n\nxxxxweriuweru925349853495\033\fasdfsdfasdf\033\f9r8\f3495\fasasf\n

In this case, the \n's and plain \f's should be ignored - they're data,
not line separators - pages are separated by \033\f, and plain \f's are
just data, and max_pages might very well be reached after the first
\033\f combination there.

In other cases, though, \f by itself is a page-eject; it depends on
what mode the printer is in.

> print "Just another Perl hacker,"
> -- 

Whatever...

Ben
--
Benjamin Fried
ben@cunixf.cc.columbia.edu				rutgers!columbia!ben

worley@compass.com (Dale Worley) (12/08/90)

   From: merlyn@iwarp.intel.com (Randal Schwartz)

   $lines_per_page = 66; $max_pages = 100; # you forgot this
   while (<>) {
	   $lineno++;
	   $pageno++, $lineno = 0 if /\f/ || ($lineno >= $lines_per_page);
	   die "Aborted at page $pageno\n" if $pageno >= $max_pages;
	   print;
   }

Well, to start with, if there are two form-feeds in one line, $pageno
will only get incremented by 1.

I may be wrong, but to find out the number of times a regexp matches a
string, you have to write an explicit loop.

Dale Worley		Compass, Inc.			worley@compass.com
--
In _EE_Times_, 16 October 1989, Intel VP David House says:  "Bill Gates
[software maker Microsoft's chairman] says no matter how much more power we
can supply, he'll develop some really exciting software that will bring the
machine to its knees."

carl@udwarf.tymnet.com (Carl Baltrunas & Cherie Marinelli 0.1.9) (12/08/90)

In article <1990Dec6.194412.9132@cunixf.cc.columbia.edu>, ben@cunixf.cc.columbia.edu (Ben Fried) writes:
> 
> I've written a filter in perl to do printer page counting for DEC ln03
> printers.  Right now it seems to be working fine, but it runs abysmally
> slowly.  The vast majority of time is taken up in the following loop:
>
> [perl script deleted] 
> 
> All that really matters is that I be able to count the
> number of pages on the input stream, and be able to break out of the
> loop after either (a) EOF on stdin, or (b) $pageno == $max_pages,
> whichever comes first.
>                                  ...more stuff deleted...
> Ben
> --
> Benjamin Fried
> ben@cunixf.cc.columbia.edu	rutgers!columbia!ben

I know that Ben didn't mention this, and so it wasn't in Randal's or Larry's
responses, so I thought that I'd bring it up.  (I've had to write filters on
non-unix machines to create microfiche-able output, so I've had to do this).

What about line length?  (Page width or whatever you want to call it) ..and
what about line wrap on long lines.  Some print filters do these things for
you, and this changes the the characteristics of the output from straight \n
and \f counting.

So, can you tell perl that while reading input lines to break lines at the point
you want to make it wrap, or do you have to use something like the following
modifications to randal's response:

$lines_per_page = 66; $max_pages = 100; # you forgot this
while (<>) {
    $lineno += ( (length($_) + $line_width - 1) / $line_width );
    $pageno += ( $lineno + $lines_per_page ) / $lines_per_page,   \
               $lineno = 0   if  /\f/;
    $pageno++, $lineno -= $lines_per_page if ($lineno >= $lines_per_page);
    die "Aborted at page $pageno\n" if $pageno >= $max_pages;
    print;
}

And one thing I left out... what to do about \t's (TABS) in text, since they
push the line width quite a bit depending on position.  (I won't mention all
those [which font?] [what about kerning?] word processing things.  They must
require handling each byte. :-)

PS. I guess an array of flags and point sizes could be built first based on the
    font used and a state machine that uses the byte value to lookup the width
    of each character and flags to add kerning offsets and can we break the line
    here... would make a pretty fast way of processing it.  Randal, can this be
    done easily in perl?  How about a 1 liner?

-Carl


 Carl A Baltrunas - Catalyst Art, Cherie Marinelli - Bijoux
 {sumex, apple}!oliveb!tymix!tardis!udwarf!{carl or cherie}
 {carl or cherie}%udwarf@tardis.tymnet.com
 {carl or cherie}@udwarf.tymnet.com

ben@cunixf.cc.columbia.edu (Ben Fried) (12/10/90)

In article <1990Dec7.212127.8161@uvaarpa.Virginia.EDU> Dale Worley writes:
>    From: merlyn@iwarp.intel.com (Randal Schwartz)
>
>    $lines_per_page = 66; $max_pages = 100; # you forgot this
>    while (<>) {
> 	   $lineno++;
> 	   $pageno++, $lineno = 0 if /\f/ || ($lineno >= $lines_per_page);
> 	   die "Aborted at page $pageno\n" if $pageno >= $max_pages;
> 	   print;
>    }
> 
> Well, to start with, if there are two form-feeds in one line, $pageno
> will only get incremented by 1.
> 
> I may be wrong, but to find out the number of times a regexp matches a
> string, you have to write an explicit loop.

This is a good point - neither Randal's nor Larry's posted code handles
this case.  

Larry's code for non-plaintext files works just great: in those cases,
the _only_ way that the printer can reach end of page is by a \f or an
\033\f, depending on mode; his idea of setting $/ to \f (and in the
second case matching against /\033\f$/) works great.

This made me wish that perl had some operator for determining the count
of matches of a pattern; is that something that sounds generally useful?

> Dale Worley		Compass, Inc.			worley@compass.com

Ben
--
Benjamin Fried
ben@cunixf.cc.columbia.edu				rutgers!columbia!ben

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (12/10/90)

In article <1990Dec10.022814.5349@cunixf.cc.columbia.edu> ben@cunixf.cc.columbia.edu (Ben Fried) writes:
: This made me wish that perl had some operator for determining the count
: of matches of a pattern; is that something that sounds generally useful?

Look up what the return value is for either tr/// or s///g.  Guess what?

Larry