[comp.lang.perl] Arrays and me :

painter@sequoia.execu.com (Tom Painter) (12/12/90)

I'd like to split a passwd file into a multi-dimensional array.  I'd
like one dimension to be the relative line number in the file, and the
other to be the field in the passwd file.  Such that the following
would print the GCOS field from the 50th entry

while (<PASSWD>) {
		$i++;
		@passwd[$i] = split(/:/);
}
printf "%s\n", $passwd[5,50];

However, I end up with only the logname field in the array.  While I could
list the parts out, I feel confident that someone has a clever answer. Please
suggest away.

Thanks

Tom
-- 
-----------------------------------------------------------------------------
Tom Painter                             UUCP: ...!cs.utexas.edu!execu!painter
Execucom Systems Corp., Austin, Texas   Internet: painter@execu.com
(512) 327-7070                                    execu!painter@cs.utexas.edu
Disclaimer: My Company?  They'll claim all my waking hours, not my opinions.
-----------------------------------------------------------------------------

tchrist@convex.COM (Tom Christiansen) (12/12/90)

In article <29144@sequoia.execu.com> painter@sequoia.execu.com (Tom Painter) writes:
>
>I'd like to split a passwd file into a multi-dimensional array.  I'd
>like one dimension to be the relative line number in the file, and the
>other to be the field in the passwd file.  Such that the following
>would print the GCOS field from the 50th entry
>
>while (<PASSWD>) {
>		$i++;
>		@passwd[$i] = split(/:/);
>}
>printf "%s\n", $passwd[5,50];
>
>However, I end up with only the logname field in the array.  While I could
>list the parts out, I feel confident that someone has a clever answer. Please
>suggest away.

Let's start out with the basics.  Your very biggest mistake is that
perl doesn't really and truly have honest-to-goodness first-class 
multidimensional arrays as you are trying to use them.  You might
consult question #17 of the FAQ.

Here are some other things:  Remember that perl arrays are by default
0-based, which makes the GCOS field index 4, not 5.  Also, since you have
$. as the current line number, there's no need to keep $i.  And when you
split into @passwd[$i], you're splitting into an array slice of length
one, which means that everything but the initial field of split is
discarded, leaving you the login, which you pronounce logname and C
programmers pronounce pw_name.  Then you say $passwd[5,50], which is
to your probably surprise really merely asking for element 50 because that
comma is the C comma operator.  And don't waste time using printf when a
simple print will do nicely.

I'm going to show you a bunch of ways to do what you've said you want to
do.  but you know, I can't help but question that you really want to do
this.  Why do you want to index by line number, of all funky things,
rather than uid or login or whatnot?

To create an array such as you'd like will be very slow.  For example,
here's my password file:

    % wc /etc/passwd
    1850    4436  152509 /etc/passwd

Parsing out password files in considered the wrong way of doing things.
You really out to be using getpwent.  But we'll get to that presently.

The first thing I'll do is a relatively literal translation of your code
into something that does more what you seem to want to do.  I'm going to
use perl's multidimensional array emulation technique of passing multiple
subscripts to an associative array reference, as opposed to an indexed
array.  Don't worry right now that it's just an emulation.  For what
you're doing, this is quite convenient.

    # METHOD 1
    while (<PASSWD>) {
	split(/:/);
	for $fld ( 0..8 ) {
	    $passwd{ $fld, $. } = $_[$fld];
	}
    }
    for $i (0..8) {
	printf "%d %s\n", $i, $passwd{$i,50};
    }

Notice I cannot do an aggregate (slice) assignment, because these aren't
really multidimensional arrays that you can take slices of (at least not
that way you can't.)

This method runs in this amount of time: 

      11.030066 real        9.374684 user        1.020688 sys

and produces this output on my system:

    0 RSuucp
    1 *
    2 14
    3 40
    4 BTL research UNIX-to-UNIX Copy
    5 /mnt/null
    6 /usr/adm/admonish/No-Account

    7
    8

Yes, it's true, I didn't chop the newline from the shell.  Anyway, 
that's RRRREEEEAAAALLLLLLLLYYYY SSSSLLLLOOOOWWWW.  There are two
reasons for this: it's a big password file, and those splits are
expensive multiplied 1850 times.  You don't really need indices
7 and 8 yet, but you might later; wait and see.

Here's another way:

    # METHOD 2
    $passwd[1 + $.] = <PASSWD> until eof(PASSWD);
    split(/:/, $passwd[50]);
    for $i (0..8) {
	printf "%d %s\n", $i, $_[$i];
    }

This runs in this time and produces the same output:

    2.271481 real        0.654493 user        1.345908 sys

which is a lot better, albeit to my mind still a tad slow.  The reason
it's an order of magnitude faster is that I don't do all those splits or
go creating all those array elements.

If you wanted, you could grab the gcos this way:

    $gcos = (split(/:/, $passwd[50]))[4];

As you can see, delaying the split until you really need it is really much
better.  In case you wonder about the  1+$.  part, it's because the $.
doesn't get bumped until after the read, and I wanted $passwd[50] to still
be the same guy.  I think I'd write a function &gcos like this to cache
the value for me so I never have to split more than once:

    # GCOS 1
    sub gcos {
	unless (defined $gcos{$_[0]}) {
	    $gcos{$_[0]} = (split(/:/, $passwd[$_[0]]))[4];
	} 
	$gcos{$_[0]};
    } 

I'm assuming that I'm calling it with 50 or whatnot, your line number.
(What a funny thing to do!)  If you're going to be using small integers,
why don't you use this:

    # GCOS 2
    sub gcos {
	unless (defined $gcos[$_[0]]) {
	    $gcos{$_[0]} = (split(/:/, $passwd[$_[0]]))[4];
	} 
	$gcos[$_[0]];
    } 

Now, remember I said that you don't really want to parse the password file
by hand?  That's because you don't really know what it looks like, for one
thing.  Consider for example NIS, nee the Yellow Plague, and the way it
handles +@netgroup and +foo::: entries and all.  Another good reason to
use these routines is that you might well have a hashed passwd file even
if you're not using YP.

Here's a method that's much more portable, based on method #1.

    # METHOD 3
    setpwent;
    while (@_ = getpwent) {
	$i++;
	for $fld ( 0..8 ) {
	    $passwd{ $fld, $i } = $_[$fld];
	}
    }
    endpwent;
    for $i (0..8) {
	printf "%d %s\n", $i, $passwd{$i,50};
    }

It runs in this time:

      11.062443 real        8.832377 user        1.186499 sys

That's not very much better than method 1.  One reason is you have to
iterate through the whole file anyway, rather than just asking for the one
value you need.  You're also making all those darn array values that you
may not ever use anyway.

Now method 3 also produces different output:

    0 RSuucp
    1 *
    2 14
    3 40
    4 0
    5
    6 BTL research UNIX-to-UNIX Copy
    7 /mnt/null
    8 /usr/adm/admonish/No-Account

That's because the getpw* functions are defined in perl to 
return the following list.  (I told you I'd used indices 
8 and 9.)

    ($name,$passwd,$uid,$gid,$quota,$comment,$gcos,$dir,$shell) = getpwent;

Now, we can speed this up a bit by just keeping the gcoses (sounds
like some kind of mental disorder, doesn't it?) with this code:

    # METHOD 4
    setpwent;
    0 while $gcos[++$i] = (getpwent)[6];
    endpwent;
    print "gcos[50] is ", $gcos[50], "\n";;

Which finally runs in respectable time:

        0.642865 real        0.404234 user        0.126877 sys

And I feel much better about perl again.

On the other hand, why suck in the whole password file if all you want is
the gcos field for line (line??? I still can't see how that makes sense)
number 50.  Let's assume you really want uid 50.   Code your gcos function
this way:

    # GCOS 3
    sub gcos {
	unless (defined $gcos{$_[0]}) {
	    $gcos{$_[0]} = (split(/:/, getpwuid($_[0]))[4];
	} 
	$gcos{$_[0]};
    } 

You might wish to index by login name instead.  Just use getpwnam
where getpwuid is being used.

Now, for those people who feel they just can't live without real (or at
least realer) multidimensional arrays for whatever the reason, here are a
couple ways of coming closer to doing that.  I still believe that this is
not what you really would like to do, but for the sake of completeness,
I'll show you anyway.  Do consider whether you honestly need the whole
passwd file in memory all the time and already split up into pieces.  I 
remain dubious.

First, we'll invoke journeyman magic by constructing an array of array
names, and load and store into it through an eval.  Here's the code;
it produces the same output as the first ones:

    # METHOD 5
    while (<PASSWD>) {
	split(/:/);
	eval "\@pass$. = \@_";
    }
    for $i (0..8) {
	printf "%d %s\n", $i, eval "\$pass" . 50 . '[$i]', "\n";
    }

It ran in this much time:

   10.885047 real        8.753669 user        0.842107 sys

Now, I could just print out $pass50[$i], but I wanted to show
the general case.

I did try joining the split inside the eval:

    # METHOD 6
    while (<PASSWD>) {
	eval "\@pass$. = split(/:/)";
    }

But it ends up running more slowly this way.  

       11.797876 real        9.447561 user        0.923269 sys

I've a strong suspicion that it's because the split regexp is getting
recompiled in each eval.  Sadly, you can't trick it by putting a /:/
outside the loop and then using // as your regexp (which normally means
the last regexp and saves the recompilation) because split interprets //
to mean to split on the null string, ie. a character at a time.

Now if you caught all that, it's time to go into still heavier wizardry,
at least by most people's standards.  We're going to use the *foo type
globbing notation to construct an array of array references.  This is
actually a bit faster this way, and anyway, I'm a bit of (computer) speed
freak.  Here's the code:

    # METHOD 7
    while (<PASSWD>) {
	split(/:/);
	*passwd = "pass$.";
	@passwd = @_;
    }
    *passwd = 'pass' . 50;
    for $i (0..8) {
	printf "%d %s\n", $i, $passwd[$i];
    }

Which trims off a couple seconds:

        9.131857 real        7.367163 user        0.686693 sys

I can save myself some more time by storing the output of
split directly into the right entry.

    # METHOD 8
    while (<PASSWD>) {
	*passwd = "pass$.";
	@passwd = split(/:/);

    }
    *passwd = 'pass' . 50;
    for $i (0..8) {
	printf "%d %s\n", $i, $passwd[$i];
    }

This runs in this much time:

        8.237005 real        6.832681 user        0.655225 sys

Now, buried deep in the perl mannovel, Larry has written the 
following rather ominous warning:

     Assignment to *name is currently recommended only inside a
     local().  You can actually assign to *name anywhere, but the
     previous referent of *name may be stranded forever.  This
     may or may not bother you.

Well, I'm not sure whether I should be bothered, since it runs
find this way, but I dutifully made a local for it and tried again:

    # METHOD 9
    while (<PASSWD>) {
	local(*passwd) = "pass$.";
	@passwd = split(/:/);
    }
    local(*passwd) = 'pass' . 50;
    for $i (0..8) {
	printf "%d %s\n", $i, $passwd[$i];
    }

However, as is to be expected, this ran slower:

        9.130706 real        7.162618 user        0.756463 sys

That's because that apparent local declaration of *passwd is really a
run-time statement, and we need to build up 1850 versions of *passwd
before we exit that block.

I must close by reiterating my suggestion to only call getpw...()
on the thing you really want and not try to suck in the whole passwd
file at once

I do hope these are enough suggestions for you. :-)

--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
"With a kernel dive, all things are possible, but it sure makes it hard
 to look at yourself in the mirror the next morning."  -me

tchrist@convex.COM (Tom Christiansen) (12/13/90)

In article <484@decvax.decvax.dec.com.UUCP> evans@decvax.DEC.COM writes:
:In article <29144@sequoia.execu.com>, painter@sequoia.execu.com (Tom Painter) writes:
:|> 
:|> I'd like to split a passwd file into a multi-dimensional array.  
:How about the following:

Yes, that works.

:	# read all of the file into the @lines array
:	open(PASSWD,"/etc/passwd");
:	push(@lines,$_) while (<PASSWD>);

	@lines = <PASSWD>
is faster.

:	close(PASSWD);
:
:	# print the 5 field of the 50th entry
:	printf "%s\n", &get_field($lines[50],5);
:
:	# subroutine which decomposes : seperated lines
:	sub get_field
:	{   local($line,$field) = @_;
:	    local(@fields) = split(/:/,$line);
:	    return(($#fields >= $field) ? $fields[$field] : "");
:	}

If you deference past the end of the array, you get null anyway.
You could actually just make this:

    sub get_field { (split(/:/,$_[0]))[$_[1]]; } 

Or else inline it.

But those are just speed optimizations, and they don't
make all *that* much difference.

--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
"With a kernel dive, all things are possible, but it sure makes it hard
 to look at yourself in the mirror the next morning."  -me

tchrist@convex.COM (Tom Christiansen) (12/13/90)

I wrote:

:    # METHOD 2
:    $passwd[1 + $.] = <PASSWD> until eof(PASSWD);
:    split(/:/, $passwd[50]);
:    for $i (0..8) {
:	printf "%d %s\n", $i, $_[$i];
:    }
:
:This runs in this time and produces the same output:
:
:    2.271481 real        0.654493 user        1.345908 sys
:
:which is a lot better, albeit to my mind still a tad slow.  

Silly me -- I even mentioned this in another post. I can just do:

    @passwd = <PASSWD>;
    unshift(@passwd, ''); # fix subscripts

and now it runs this fast:

    0.650994 real        0.380643 user        0.175185 sys

which is better still.  Why all the system time difference?
I'm not sure.  I did run it a few times so as to get it all in 
cache, but I did that for both versions.  Larry, any ideas?
Are you calling sbrk more efficiently (like once for a bunch)
with the gobble-up-the-whole-file method?

--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
"With a kernel dive, all things are possible, but it sure makes it hard
 to look at yourself in the mirror the next morning."  -me

painter@sequoia.execu.com (Tom Painter) (12/14/90)

In article <110684@convex.convex.com> tchrist@convex.COM (Tom Christiansen) writes:
>In article <29144@sequoia.execu.com> painter@sequoia.execu.com (Tom Painter) writes:
>>
>>I'd like to split a passwd file into a multi-dimensional array.
>> [...stuff deleted...]
>
>Let's start out with the basics.  Your very biggest mistake is that
>perl doesn't really and truly have honest-to-goodness first-class 
>multidimensional arrays as you are trying to use them.  You might
>consult question #17 of the FAQ.

I realize this, but apparently you can emulate it with an associative array:

 		$name{$x, $y} = 'Tom';

I suppose that the question was: How do I fill a psuedo-multi-dimensional
array.  Preferably, I'd like to fill it with the output of a split command.
I had hoped that there was a clever Perl (minimalistic) answer.

>Here are some other things:  Remember that perl arrays are by default
>0-based, which makes the GCOS field index 4, not 5.  Also, since you have
>$. as the current line number, there's no need to keep $i.  And when you
>split into @passwd[$i], you're splitting into an array slice of length
>one, which means that everything but the initial field of split is
>discarded, leaving you the login, which you pronounce logname and C
>programmers pronounce pw_name.  Then you say $passwd[5,50], which is
>to your probably surprise really merely asking for element 50 because that
>comma is the C comma operator.  And don't waste time using printf when a
>simple print will do nicely.

Now, I have to wonder why every simple question asked is turned into a
"Let's bash the novice" festival.  My example seemed to be sufficient
to ask the question that I asked.  I'll address the objections.  I
already reset the array base to 1, I suppose that I should've included
that line.  I used (<PASSWD>) for the example, however I'm dealing with
multiple input files so I want to count lines.  Obviously, what I had
was wrong [@passwd[$i]] or I wouldn't have posted the question (No, I
wasn't surprised when it didn't work.).  I'm not a C programmer, so
what they call logname (login, pw_name) is not important to my
question.  The "C comma operator" may be a glaring fact to you but to
me it doesn't mean a thing.  And finally, the printf line was what was
left after I stripped the non-essential portions from the original.
While it could be better in its current use, that's not necessary it
terms of the question.

Now, I think that number of the points are valid in the context of
teaching netreaders the fundamentals of Perl, but I think that the tone
could use some work.  If you want to call me an idiot, send me mail.
But please restrict your posts to helpful suggestions or solutions.  I
have to wonder how many people out there are hesitant to post given the
thrashing that you usually put me through. :-)

BTW, I did pick up a number of helpful hints, once I got past the first
section.  Thanks.

Tom

P.S. Yea, I should've mailed this...
-- 
-----------------------------------------------------------------------------
Tom Painter                             UUCP: ...!cs.utexas.edu!execu!painter
Execucom Systems Corp., Austin, Texas   Internet: painter@execu.com
(512) 327-7070                                    execu!painter@cs.utexas.edu
Disclaimer: My Company?  They'll claim all my waking hours, not my opinions.
-----------------------------------------------------------------------------

tchrist@convex.COM (Tom Christiansen) (12/15/90)

There was no bashing intended in the post.  Sometimes I was trying to be
brief and summarize (like the first couple paragraphs) while at other
times I was trying to be funny (like with the logname pronunciation).  I
was honestly just trying to help.  I'm truly sorry the original poster
took it any other way.  All novice questions are welcome.

--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
"With a kernel dive, all things are possible, but it sure makes it hard
 to look at yourself in the mirror the next morning."  -me

--
Tom Christiansen		tchrist@convex.com	convex!tchrist
"With a kernel dive, all things are possible, but it sure makes it hard
 to look at yourself in the mirror the next morning."  -me

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (12/18/90)

In article <111418@convex.convex.com> tchrist@convex.COM (Tom Christiansen) writes:
: There was no bashing intended in the post.  Sometimes I was trying to be
: brief and summarize (like the first couple paragraphs) while at other
: times I was trying to be funny (like with the logname pronunciation).  I
: was honestly just trying to help.  I'm truly sorry the original poster
: took it any other way.  All novice questions are welcome.

WARNING:  Anyone who can't handle "funny" shouldn't buy the Perl Book,
in which we (attempt to) crack a few jokes.  (There were actually more in
the draft, but we took out some that came across as patronizing, at the
request of some perceptive reviewers--one of which was Tom!  Ah, well... :-)

Larry