[comp.lang.perl] Fast way to join lines ?

hai@hpfcso.HP.COM (Hai Vo-Ba) (12/01/90)

	I am using perl to join every N lines of a very large file
    together and wonder what is the faster way to do this:

-----------------------------------------------------------------------

#!/usr/local/bin/perl

$\ = "\n";              # set output record separator

while (<>) {
    chop;       # strip record separator
    $line .= $_;
    if (($. % 32) == 0) {
        print $line;
        $line = '';
    }
}

if ($line ne '') { print $line }

-----------------------------------------------------------------------

	Thanks in advance.

Hai  Vo-Ba                              (303)229-3874
IC Business Division                    hai@hpfihvb.fc.hp.com
Hewlett Packard Co.                     MS 72
3404 E. Harmony Rd.                     Fort Collins, Colorado 80525-9599

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (12/02/90)

In article <9830001@hpfcso.HP.COM> hai@hpfcso.HP.COM (Hai Vo-Ba) writes:
> 	I am using perl to join every N lines of a very large file
>     together and wonder what is the faster way to do this:

The faster way is something like this:

#include <stdio.h>
main()
{
 int ch; int t = 33;
 while ((ch = getchar()) != EOF)
  {
   if (ch == '\n') if (--t) continue; else t = 33;
   putchar(ch);
  }
}

---Dan

tchrist@convex.COM (Tom Christiansen) (12/02/90)

In article <2967:Dec122:39:3790@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
:In article <9830001@hpfcso.HP.COM> hai@hpfcso.HP.COM (Hai Vo-Ba) writes:
:> 	I am using perl to join every N lines of a very large file
:>     together and wonder what is the faster way to do this:

(code restored for further reference)
:>
:>      $\ = "\n";              # set output record separator
:>
:>      while (<>) {
:>          chop;       # strip record separator
:>          $line .= $_;
:>          if (($. % 32) == 0) {
:>              print $line;
:>              $line = '';
:>          }
:>      }
:>
:>      if ($line ne '') { print $line; }

Dan then writes:

:The faster way is something like this:
:
:#include <stdio.h>
:main()
:{
: int ch; int t = 33;
: while ((ch = getchar()) != EOF)
:  {
:   if (ch == '\n') if (--t) continue; else t = 33;
:   putchar(ch);
:  }
:}

Well, I'm afraid we've got just a few problems here.  

The first one is that the C code doesn't do what the Perl code does, and
the poster requested a faster way to do the same thing.  The Perl construct
"while (<>)" is not equivalent to "while (<STDIN>)".  The construct used
by the original poster will traverse its command line argument list and
treat it as one continuous input stream, correctly processing any "-"
arguments, and defaulting to stdin if no arguments are given.  Dan's code
only consults stdin, so it's not as functional.

The second problem is that (as I mentioned before) while it's good to
maintain perspective of using the right tool for the job at hand, this
*IS* comp.lang.perl, and the poster seemed to be clearly searching for a
perlian solution to his problem.  How do you, Dan, know that this wasn't
just a code fragment extracted for demonstration purposes from a larger
program of the posters?

Look at it this way:  if I hung around comp.lang.c and kept posting Perl
solutions to people's C questions, it would eventually grate on people's
nerves.  A non-productive flame war would start up that would waste
net bandwidth, the readers' time, and just generally rain on everyone's
parade unnecessarily.  We've had a very flame-free, productive little
group here since its inception, so let's keep it that way, OK?

The third problem is that the poster asked for a faster way.  There are
several interpretations of faster, including but not limited to faster
writing time, faster compile time, faster debugging time, and faster run
time.  First let me offer a faster Perl version of the poster's original
code:

    while (<>) {
	chop if $. % 32;
	print;
    }

If this doesn't need to be part of another program, you might as
well just do it this way:

    perl -pe 'chop if $. % 32'

or else

    perl -pe 'chop if $. & 31'

Now, let's first talk run time here.  On my 2250-line termcap file, Dan's
C program (which you'll recall doesn't do all that the Perl one does) runs
in this much time:

	0.450524 real        0.340401 user        0.054859 sys

whereas my Perl one-liner runs in just this much time:

	0.684193 real        0.450535 user        0.083110 sys

I find that pretty respectable; I don't think we're going to quibble about
a couple seconds, let alone eleven hundredths of a second of user time.

[ I probably shouldn't even mention that mine if we eat the whitespace, mine
can be reduced to 12 bytes, and Dan's to 130, but I just did anyway. :-/ ]

As far as I'm concerned, and I'll bet you this goes for most of the rest
of the readership of this newsgroup as well, anything that you can express
as a quick one-liner without having to go into an editor (let alone
compile an a.out!) is worth doing that way.  Those 0.11 seconds of user
time you lost on the run is more than made up for in how fast it took you
to write and run the Perl code.  Furthermore, it's a lot more legible
because its complexity is drastically reduced, which means it'll be more
maintainable as well.


--tom

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (12/03/90)

In article <109688@convex.convex.com> tchrist@convex.COM (Tom Christiansen) writes:
> In article <2967:Dec122:39:3790@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> :In article <9830001@hpfcso.HP.COM> hai@hpfcso.HP.COM (Hai Vo-Ba) writes:
> :> 	I am using perl to join every N lines of a very large file
> :>     together and wonder what is the faster way to do this:
     [ some code ]
> Dan then writes:
> :The faster way is something like this:
   [ a similar amount of C code ]
> Well, I'm afraid we've got just a few problems here.  

Before I respond to your comments, here's a very quick essay on the
point of comp.lang.perl.

A couple of weeks ago, someone posted a 100-line program to comp.lang.c
and asked the net to debug it for him. Doug Gwyn gently reminded him
that just because a programming problem happens to be in C doesn't mean
it has anything to do with comp.lang.c. He then answered the question
anyway, pointing out the bugs in the program. Henry Spencer also went
out of his way to say how inappropriate the posting was.

Some months back, several people said in news.groups that the proposed
comp.lang.perl would carry a lot of inappropriate content. Somebody even
suggested comp.sources.perl. They were right. Although Larry, Randal,
and Tom try hard to make it otherwise, this group is flooded with
articles having no more to do with the Perl language than that debugging
problem had to do with C.

There aren't many groups dedicated to programming as an issue in itself.
Sure, there's comp.unix.programmer for UNIX-specific programming, and
rec.games.programmer for games programming, and comp.software-eng for
theoretical crap. But there's nowhere a programmer can turn if he wants
to get advice on a programming problem that doesn't have to do with
UNIX, or games, or whatever. What happens? Joe Shmoe says ``It's in C!
So I'll post to comp.lang.c!'' Or ``It's under UNIX! So I'll post to
comp.unix.programmer!'' Or ``It's in Perl! So I'll try comp.lang.perl!''

So much for the essay. What happened here? Someone posted a problem to
comp.lang.perl. I looked at it and said ``Oh, he's reading in lines, and
taking the line number mod 32. But he's not manipulating the lines. So
why doesn't he just copy the input to the output, leaving off the
newlines except every 32nd? And he should use a rotating counter instead
of taking mods, so he can handle any size file, and doesn't have to
worry about machines with slow division.''

In other words, I treated it as a programming problem. I responded
likewise. At first I just wrote what I had above. Then I decided that a
program would make more effective exposition than English text. Since I
program better in C than in Perl, I stuck to C.

Did that make my article appropriate for comp.lang.c? No. A general
programming problem is never appropriate for a language newsgroup,
unless some language feature greatly affects the coding technique.
Just because something is C code doesn't make it right for comp.lang.c.
And just because something is in Perl doesn't make it right for
comp.lang.perl.

> The first one is that the C code doesn't do what the Perl code does,
  [ the Perl version can read files, my version reads stdin ]

BFD. He can cat the original files, then pipe them through the filter.
Or on any system he can add five lines of argument processing to the
program. (I've been trying to convince Berkeley to add a library for
this job to BSD. If they do, it could be standard by 1997... [grin])

(It is true that the code does something different, btw: it runs t in a
cycle of length 33, to point out that a rotating counter is always fast,
while % or & won't be very fast if the cycle length isn't a power of 2.
Of course, an optimizer with really smart reduction could do this
transformation for itself.)

> The second problem is that (as I mentioned before) while it's good to
> maintain perspective of using the right tool for the job at hand, this
> *IS* comp.lang.perl, and the poster seemed to be clearly searching for a
> perlian solution to his problem.

Read my essay above. I'm quite sure that if I had used words (``Just
copy characters to the output. Rotate a counter on each newline; only
print the newline if the counter is 0.'') you wouldn't be complaining.
Now you're offended because I decided that C code would illustrate this
more effectively than words?

When someone posts an article in English, some Germans get a translator.
Those who can read English appreciate that the poster had something to
say, and didn't know how to say it as effectively in German as in his
native language. Those who were also brought up in English don't even
think about the choice of language; they just pay attention to the point
at hand. So what if the article was posted to alt.prose?

> Look at it this way:  if I hung around comp.lang.c and kept posting Perl
> solutions to people's C questions, it would eventually grate on people's
> nerves.

No: it grates on people's nerves when people ask general programming
problems in comp.lang.c. But once a thread has been established in the
wrong group, it's more polite to stick to that decision than to split
off into another inappropriate group.

> We've had a very flame-free, productive little
> group here since its inception, so let's keep it that way, OK?

You're the one who started flaming. I was just answering a programming
question.

> The third problem is that the poster asked for a faster way.  There are
> several interpretations of faster, including but not limited to faster
> writing time, faster compile time, faster debugging time, and faster run
> time.

That's not a problem with my code; it's a problem with your definition
of ``faster.'' The only objective measure is faster run time, and by
that measure I did answer the question.

Compile time depends on what you mean by ``compile''---I'd say Perl
keeps compiling every time you use the code, while you only need to
compile C when you change it. Writing time and debugging time are quite
subjective---for you, a Perl solution may be faster to write, but for
me, a C solution is faster. Why not stick to the objective terms?

  [ ... ]
>     perl -pe 'chop if $. % 32'
> or else
>     perl -pe 'chop if $. & 31'

Now that's showing people how to use Perl more effectively. But suppose
I had seen your answer first, and wanted to say that (at least in most
languages) it's faster for to process characters instead of lines? The
original question was about making the program run faster. Petty Perl
optimizations are cute, but an improved algorithm is more effective.

  [ C:    0.450524 real   0.340401 user   0.054859 sys ]
  [ Perl: 0.684193 real   0.450535 user   0.083110 sys ]

I agree, 50% slower is respectable for a general tool. But it's still
50% slower.

> As far as I'm concerned, and I'll bet you this goes for most of the rest
> of the readership of this newsgroup as well, anything that you can express
> as a quick one-liner without having to go into an editor (let alone
> compile an a.out!) is worth doing that way.

Yes, it's worth doing that way. But it's worth even more to recode these
things in C. It took me thirty seconds from start to finish to write and
compile that code. If the program is used more than 150 times, it's
worth it.

> Furthermore, it's a lot more legible
> because its complexity is drastically reduced, which means it'll be more
> maintainable as well.

Oh? I look at the C program and see ``Process each character. Skip all
but every 33rd newline. Copy to output.'' These C idioms are much more
familiar to me than the mere definition of Perl's ``chop''. So to me,
and probably to lots of other C programmers, the C code is much more
maintainable.

---Dan