[comp.lang.perl] Help a perl apprentice

tbc@hp-lsd.COS.HP.COM (Tim Chambers) (10/09/90)

I have just written my first perl script.  Ok, so it's not as cryptic as TECO
(which I never did master) but it's still a pretty nonintuitive language to
me.

So I ask the perl wizards to help me get used to programming in this new
beasty.  I'm enclosing the perl script, the sh script I started with, and an
example of the program's input.  My questions are:

1.  Why doesn't the "next unless" line I've commented out do what I want it
    to?  (Trust me, it doesn't on my HP 9000 Series 300 HP-UX implementation.)

2.  Although I realize this isn't a teaching center, would anyone care to
    offer some advice on how to optimize my perl implementation of the
    algorithm?  The algorithm I use is (1) capture in an array, (2) sort the
    array, (3) format and print the array elements.  I found it clumsy to do
    all in perl, but it still runs faster than the filters I use in the shell
    script.  Are there perl-isms I am missing that make the algorithm run more
    efficiently?  (Especially my by_free sorting subroutine.)

##############################################################################
#!/usr/local/bin/perl
# $Header: mydf,v 1.3 90/10/08 17:17:19 tbc Exp $

sub by_free {
    @a = split(' ', $a);
    @b = split(' ', $b);
    $a[3] lt $b[3] ? 1 : $a[3] gt $b[3] ? -1 : 0;
}

open (BDF_PIPE, "bdf $* |");

# output
print "Filesystem                     Free (Mb) %Used\n";
print "------------------------------|---------|-----\n";
for ($i=0;<BDF_PIPE>;$i++) {
# why doesn't this work?!
#    next unless (/^.*\%.*$/);
    next unless /^[^ ]+ /;
    next if /Mounted/;
    @lines[$i] = $_;
};

close (BDF_PIPE);

@sortedlines = sort by_free @lines;

foreach $line (@sortedlines) {
    ($f1, $f2, $f3, $avail, $capacity, $dirname) = split(' ', $line);
    printf '%s', $dirname;
    for ($i = 30; $i > length($dirname); $i--) {
	printf ('.');
	}
    printf "%10.1f%6s\n", $avail / 1000, $capacity;
};

print "\n";

##############################################################################
#!/bin/sh
# $Header: mydf,v 1.2 90/09/06 08:42:15 tbc Exp $
echo "Filesystem                     Free (Mb) %Used"
echo "------------------------------|---------|-----"

# first filter to get around bdf's "pretty" format, then use awk
bdf $* | cut -c36-80 | fgrep '%' | sort -nr | \
awk '{ printf("%s", $3)
for (i = 30 ; i > length($3) ; i --) printf(".")
printf ("%10.1f%6s\n", $1 / 1000, $2) }'

echo

##############################################################################
# this is what "bdf" does, in case it's not on non-HP-UX systems
$ bdf
Filesystem           kbytes    used   avail capacity Mounted on
/dev/root            487022  405811   32508    93%   /
/dev/dsk/17s7        278082  240761   23416    91%   /dsk17.07
/dev/dsk/17s3        278082  202040   48233    81%   /users4
/dev/dsk/0s2         278082  188859   61414    75%   /disk3
/dev/dsk/0s4         119429   74980   32506    70%   /users2
/dev/dsk/0s3         230127  166056   41058    80%   /users
hp-lsd:/usr/spool/notes
                     387657  248645  131258    65%   /usr/spool/notes
hplsdry:/work2/hp64000/hptwk/twk_db
                      90607   69076   17000    80%   /usr/hp64000/hptwk/twk_db

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (10/24/90)

In article <18840001@hp-lsd.COS.HP.COM> tbc@hp-lsd.COS.HP.COM (Tim Chambers) writes:
: I have just written my first perl script.  Ok, so it's not as cryptic as TECO
: (which I never did master) but it's still a pretty nonintuitive language to
: me.

Strange, seems pretty intuitive to me...  :-)

And no, I wasn't the person who wrote a Fortran compiler in TECO.

: So I ask the perl wizards to help me get used to programming in this new
: beasty.  I'm enclosing the perl script, the sh script I started with, and an
: example of the program's input.  My questions are:
: 
: 1.  Why doesn't the "next unless" line I've commented out do what I want it
:     to?  (Trust me, it doesn't on my HP 9000 Series 300 HP-UX implementation.)

Probably because of a bug I just fixed, presuming you're on patchlevel 36 or 37.

: 2.  Although I realize this isn't a teaching center, would anyone care to
:     offer some advice on how to optimize my perl implementation of the
:     algorithm?  The algorithm I use is (1) capture in an array, (2) sort the
:     array, (3) format and print the array elements.  I found it clumsy to do
:     all in perl, but it still runs faster than the filters I use in the shell
:     script.  Are there perl-isms I am missing that make the algorithm run more
:     efficiently?  (Especially my by_free sorting subroutine.)

I'll talk about your script point by point, and then give how I'd write it.

: ##############################################################################
: #!/usr/local/bin/perl
: # $Header: mydf,v 1.3 90/10/08 17:17:19 tbc Exp $
: 
: sub by_free {
:     @a = split(' ', $a);
:     @b = split(' ', $b);

You realize that this splits each line multiple times?  This is most of
your extra overhead.

:     $a[3] lt $b[3] ? 1 : $a[3] gt $b[3] ? -1 : 0;

As of 36, you can just say

	$b[3] <=> $a[3];

: }
: 
: open (BDF_PIPE, "bdf $* |");

A nit, but you want

  open (BDF_PIPE, "bdf @ARGV |");

or better,

  open (BDF_PIPE, "bdf @ARGV |") || die "Can't run bdf: $!\n";

: # output
: print "Filesystem                     Free (Mb) %Used\n";
: print "------------------------------|---------|-----\n";

Maybe a hair more efficient and readable (your call) to say:

print <<EndHeader;
Filesystem                     Free (Mb) %Used
------------------------------|---------|-----
EndHeader

: for ($i=0;<BDF_PIPE>;$i++) {

In general, use of index variables (such as $i) is a warning sign that
there's a better way to do it in Perl.

: # why doesn't this work?!
: #    next unless (/^.*\%.*$/);

The bug I mentioned.  The optimizer thought it could do tail matching on /%$/.
But it's silly to write the pattern that way anyway.  Why not

	next unless /%/;

But it's better not to reject lines based on something toward the end
of the line.

:     next unless /^[^ ]+ /;

This is better, in that it fails at the beginning of the line on the lines
it fails on.  But when it succeeds, why check more than one character?
I'd also invert the logic:

	next if /^ /;

:     next if /Mounted/;

Again, this is checking for something at the end.  Why not this?

	next if /^Filesystem/;

:     @lines[$i] = $_;

This also stores info you aren't going to use later.  In my example
below I'll turn this logic inside out.

: };
: 
: close (BDF_PIPE);

Not strictly necessary on an input pipe, but it doesn't hurt anything, and
is good documentation.  It IS necessary to close OUTPUT pipes when you're done
with them.

: @sortedlines = sort by_free @lines;

This is basically an extra scratch variable, since you can insert the
sort into the foreach.  (Though it doesn't hurt, because when you insert
the sort into the foreach, it ends up making an internal scratch variable
anyway...)

: foreach $line (@sortedlines) {
:     ($f1, $f2, $f3, $avail, $capacity, $dirname) = split(' ', $line);

Yet another split.  If you split on /\s+/, you could do the nfs filesystems
too, since $f1 would just be null.  I include the nfs filesystems in mine.

:     printf '%s', $dirname;

printf is slower than print.  Should be just

	print $dirname;

:     for ($i = 30; $i > length($dirname); $i--) {
: 	printf ('.');
: 	}

Again, use print, not printf.  Better, don't use a loop at all:

	print '.' x (30 - length($dirname));

:     printf "%10.1f%6s\n", $avail / 1000, $capacity;
: };
: 
: print "\n";

Okay, here's my version:

#!/usr/local/bin/perl

open (BDF_PIPE, "bdf @ARGV |");

print <<End;
Filesystem                     Free (Mb) %Used
------------------------------|---------|-----
End

while (<BDF_PIPE>) {
    ($fs, $kbytes, $used, $avail, $capacity, $dirname) = split;
    next unless $kbytes > 0;
    # next unless $fs;		# uncomment to delete nfs filesystems
    $line{$avail} = sprintf("%30.30s%10.1f%6s\n",
			$dirname . '.' x 30,
			$avail / 1000, $capacity);
}

sub revnum { $b <=> $a; }

foreach $key (sort revnum keys(%line)) {
    print $line{$key};
}

print "\n";

We split just once per line, using a split that will leave $fs null on
the nfs lines.  We find a single test that weeds out both the header
and the 1st of each pair of nfs lines.  We make an associative array
keyed on the field we will sort on, formatting the output string
right then and there.

After we've finished the input, we just do a simple reverse numeric sort
on the keys of the line array, and print out each value.  Actually,
we could scrap that loop too and use a slice like this:

    print @line{sort revnum keys(%line)}, "\n";

That *doesn't* make a scratch variable internally.

Larry

tchrist@convex.COM (Tom Christiansen) (10/24/90)

In article <10080@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV 
(Larry Wall) writes an excellent analysis of how to optimize a beginning
perl script.  

One little picky point I have is the pipe open.  I first posted about this
kind of thing in <5016@convex.convex.com> entitled "Testing successful
pipe opens" on 31 January for archive nuts.  It's been awhile, so I'll
bring it up again for people who weren't reading then.

>  open (BDF_PIPE, "bdf @ARGV |");
>
>or better,
>
>  open (BDF_PIPE, "bdf @ARGV |") || die "Can't run bdf: $!\n";
>

That will almost never fail.  You might get back EGAIN or EPROCLIM
due to fork failure, but far more often, the fork succeeds but
the execvp fails, and you get no notification of this.  This is 
exactly what happened to me when I ran the script, since on my 
system, the command is "df" not "bdf".  This is one of those little
gotchas in perl.


One solution is to test for whether $? is set after the close.  Note
that Larry said that close is not strictly necessary on an input pipe.
This, I would say, is a good enough reason to close it and make sure
the close went well.

Thus:
	close BDF_PIPE  || die "bdf pipe didn't close: $?";

although at PL18 the close doesn't correctly fail on setting $?; it does
do the right thing at later patches.  For earlier versions, use:

	close BDF_PIPE;
	die "bdf pipe didn't close: $?" if $?;

You can also find out at the time of the open, although this is trickier.
A solution for a command without metacharacters would be like this:

 ($pid = open (BDF_PIPE, "bdf @ARGV |")) && kill(0, $pid)
    || die "can't start pipe: $!";

or for people preferring C syntax to shell:

    if (!($pid = open (BDF_PIPE, "bdf @ARGV |")) || !kill(0, $pid))
	{ die "can't start pipe: $!"; }


If, however, there are shell meta-characters, the shell may not yet
have failed by the time the kill happens, so for that case you would
want to introduce a delay with sleep, (although I dislike it) as in:

 ($pid = open (BDF_PIPE, "cmd >foo |")) && (sleep(1), kill(0, $pid))
    || die "can't start pipe: $!";

or 

    ($pid = open(BDF_PIPE, "$cmd >$foo |")) || die "fork failed: $!";
    sleep 1;
    kill(0, $pid) || die "cmd \"$cmd > $foo\" failed: $?";


Note that if you're opening for output to a non-existent command, you
will eventually get hit with a SIGPIPE.  I often install a plumber
function for this:

    sub plumber { die "$0: plumber: broken pipe"; }
    SIG{'PIPE'} = 'plumber';

I thought that now that we have caller(), I should be able to catch
where this actually happened to me, but that doesn't seem to work.  It
complains that "There is no caller" when I try to use caller from the
&plumber routine, which is too bad.

--tom

fuchs@it.uka.de (Harald Fuchs) (10/24/90)

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
>: open (BDF_PIPE, "bdf $* |");
>A nit, but you want
>  open (BDF_PIPE, "bdf @ARGV |");
>or better,
>  open (BDF_PIPE, "bdf @ARGV |") || die "Can't run bdf: $!\n";

Why? open() returns the pid of the subprocess, and this should never
be zero even if there is no executable `bdf' in our $PATH.

Is there a canonical way to check if an open() involving a pipe succeeded?
--

Harald Fuchs <fuchs@it.uka.de> <fuchs%it.uka.de@relay.cs.net> ...
<fuchs@telematik.informatik.uni-karlsruhe.dbp.de>   *gulp*

dave@vlsi-mentor.jpl.nasa.gov (David Hayes) (10/25/90)

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
>In general, use of index variables (such as $i) is a warning sign that
>there's a better way to do it in Perl.

Really? Please enlighten me....

I want to process an array of stuff @stuff. Lets say that I want to 
do a "tolower" on each element of stuff (an array of text strings).

The current way I do this is (yes I am a C programmer to the core):
                                        
for($i=0; $i<$#stuff; $i++) {
	$stuff[$i] =~ tr/A-Z/a-z/;
}

Is there a better way?
----
Dave Hayes  dave@vlsi-mentor.jpl.nasa.gov   
			{ucbvax,ames}!elroy!vlsi-mentor!dave
"Attempt to find truth by realizing that it will generally find YOU."
-- 
----
Dave Hayes  dave@vlsi-mentor.jpl.nasa.gov   
			{ucbvax,ames}!elroy!vlsi-mentor!dave
"Attempt to find truth by realizing that it will generally find YOU."

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (10/25/90)

In article <1990Oct25.063526.1571@vlsi-mentor.jpl.nasa.gov> dave@vlsi-mentor.jpl.nasa.gov (David Hayes) writes:
: lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
: >In general, use of index variables (such as $i) is a warning sign that
: >there's a better way to do it in Perl.
: 
: Really? Please enlighten me....
: 
: I want to process an array of stuff @stuff. Lets say that I want to 
: do a "tolower" on each element of stuff (an array of text strings).
: 
: The current way I do this is (yes I am a C programmer to the core):
:                                         
: for($i=0; $i<$#stuff; $i++) {
: 	$stuff[$i] =~ tr/A-Z/a-z/;
: }
: 
: Is there a better way?

Yes.  There are some iterators that cause the loop variable to refer to
each array element directly, much like a C programmer would use a pointer to
iterate through an array, only without having to dereference anything.
You can simply say

    foreach $elem (@stuff) {
	$elem =~ tr/A-Z/a-z/;
    }

or, more briefly,

    for (@stuff) {	# equivalent to "foreach $_ (@stuff)"
	tr/A-Z/a-z/;
    }

or even more briefly,

    grep(tr/A-Z/a-z/, @stuff);

The other place that index variables pop up unnecessarily is when people
write linear searches down a numerically indexed array, when they should be
using associative arrays to do a direct lookup, or using one of the other
built-in searching mechanisms.

Larry

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (10/26/90)

In article <107608@convex.convex.com> tchrist@convex.COM (Tom Christiansen) writes:
: You can also find out at the time of the open, although this is trickier.
: A solution for a command without metacharacters would be like this:
: 
:  ($pid = open (BDF_PIPE, "bdf @ARGV |")) && kill(0, $pid)
:     || die "can't start pipe: $!";
: 
: or for people preferring C syntax to shell:
: 
:     if (!($pid = open (BDF_PIPE, "bdf @ARGV |")) || !kill(0, $pid))
: 	{ die "can't start pipe: $!"; }
: 
: 
: If, however, there are shell meta-characters, the shell may not yet
: have failed by the time the kill happens...

It's worse than that.  You're assuming vfork semantics.  If they don't
have vfork, the parent can continue executing even before the child has
a chance to exec.

: ...so for that case you would
: want to introduce a delay with sleep, (although I dislike it) as in:
: 
:  ($pid = open (BDF_PIPE, "cmd >foo |")) && (sleep(1), kill(0, $pid))
:     || die "can't start pipe: $!";
: 
: or 
: 
:     ($pid = open(BDF_PIPE, "$cmd >$foo |")) || die "fork failed: $!";
:     sleep 1;
:     kill(0, $pid) || die "cmd \"$cmd > $foo\" failed: $?";

sleep(1) isn't necessarily good enough either, since it can sleep anywhere
from 0 to 1 seconds.

In general this is unnecessary, since you simply get an immediate eof on
the pipe read, and then you can pick up the exit status with close, as
Tom rightly pointed out.

: Note that if you're opening for output to a non-existent command, you
: will eventually get hit with a SIGPIPE.  I often install a plumber
: function for this:
: 
:     sub plumber { die "$0: plumber: broken pipe"; }
:     SIG{'PIPE'} = 'plumber';

You can arrange for a faulty exec on an input pipe to give you a SIGPIPE too:

	$pid = open(IN,"-|");
	die "Can't fork: $!\n" unless defined $pid;
	unless ($pid) {
	    exec "program", "that", "may", "not", "exist";
	    kill('PIPE', getppid);
	    exit 1;
	}

: I thought that now that we have caller(), I should be able to catch
: where this actually happened to me, but that doesn't seem to work.  It
: complains that "There is no caller" when I try to use caller from the
: &plumber routine, which is too bad.

Yeah, that'd be nice, but signals happen asynchronously, and you don't
necessarily have a consistent context, so it currently pretends you don't
have a caller.  A sort subroutine doesn't have a context either, currently,
though that has no problems with context consistency, and could be done
fairly efficiently (we'd only have to establish the context once per sort,
not once for each sub call).

I could probably make signal handlers have a context, but I'll have to
look at the consequences a little more.

Larry

dupuy@cs.columbia.edu (Alexander Dupuy) (10/29/90)

In article <10080@jpl-devvax.JPL.NASA.GOV> Larry Wall writes:

| #!/usr/local/bin/perl
| 
| open (BDF_PIPE, "bdf @ARGV |");
| 
| print <<End;
| Filesystem                     Free (Mb) %Used
| ------------------------------|---------|-----
| End
| 
| while (<BDF_PIPE>) {
|     ($fs, $kbytes, $used, $avail, $capacity, $dirname) = split;
|     next unless $kbytes > 0;
|     # next unless $fs;		# uncomment to delete nfs filesystems
|     $line{$avail} = sprintf("%30.30s%10.1f%6s\n",
| 			   $dirname . '.' x 30,
| 			   $avail / 1000, $capacity);
| }
| 
| sub revnum { $b <=> $a; }
| 
| foreach $key (sort revnum keys(%line)) {
|     print $line{$key};
| }
| 
| print "\n";
| 
| We split just once per line, using a split that will leave $fs null on
| the nfs lines.  We find a single test that weeds out both the header
| and the 1st of each pair of nfs lines.  We make an associative array
| keyed on the field we will sort on, formatting the output string
| right then and there.

Just hope that you never have two filesystems that are both 86% full, since you
will only see the output for one of them.  You have to be careful when manually
optimizing a program that you don't change the semantics inadvertently...

I would use the $dirname field as the key, and do an assoc array lookup in the
revnum function, with the performance penalty, in order to know that I'm
getting the right answer.

@alex

--
-- 
inet: dupuy@cs.columbia.edu
uucp: ...!rutgers!cs.columbia.edu!dupuy