[comp.lang.perl] sorting multi-dimensional arrays, or pointers in perl

tchrist@convex.com (Tom Christiansen) (05/15/91)
[Yesterday I offered to demonstrate sorting multi-dimensional arrays on
arbitrary keys efficiently and generically if anyone had any interest in
this.  Well, my mailbox says there was a good deal of interest, so it here
it all is.]


The first thing to think about when sorting N-dim arrays in perl is that
there is no such thing, or rather, their internal represenation is the
same as a normal assoc array with the keys joined together by the $;
variable. This means that to do any comparisons, you'll need to split the
keys.

Let's say we want to sort on keys 2, then 1, then 0.  You could approach
it this way (assume numeric sort):


    for $elt (sort by_keys keys %array) {
	print $array{$elt}, "\n";
    }
	    
    sub sort_keys {
	@a = split($;, $array{$a});
	@b = split($;, $array{$b});

	$a[2] <=> $b[2] ||
	$a[1] <=> $b[1] ||
	$a[0] <=> $b[0];
    } 

This approach, though, is one fraught with problems.  One problem is that
it's got too much hardcoded: both the name of the array and which elements
to compare are hard-wired into the code.  Yesterday, I showed how to avoid
this by using pass-by-name and dynamic scoping.  You could also use an
eval to compose the subroutine on the fly.

A bigger problem is performance: to split the key on EACH compare means
something like N*logN splits, where N is the number of keys you've got,
assuming perl's sort is an O(N*logN) one, which I think it is if it calls
qsort().  This problem arises whenever what you're sorting is not really
just $a and $b, but rather some piece of them.  If you find yourself
splitting or unpacking in a sort subroutine, or pulling things out with a
regexp match, then you're very probably doing too much work.   

A case where this would be ok to do this anyway would be when your N were
very small.  I do such a thing in my man program when I'm sorting
available man pages (like stty.1 and stty.3c) based on a user-defined
precedence.  There are almost never more than topics than match on one
hand, so I don't mind pulling out the section from the page name with 
a regexp match each time.  Here's the sort subroutine I use there, where
what's being passed is filenames of man pages. 

    sub bysection {
	local ($ext1, $ext2, $precedence1, $precedence2, $section1, $section2);

	($section1, $ext1) = $a =~ m:/man([^/]+)/.*\.([^.]+)(\.Z)?$:;
	($section2, $ext2) = $b =~ m:/man([^/]+)/.*\.([^.]+)(\.Z)?$:;

	$ext1 = $section1 if $ext1 !~ /^${section1}.*/;  # like mano/gtty.3
	$ext2 = $section2 if $ext2 !~ /^${section2}.*/;

	$precedence1 = $MANSECT{$ext1} || $MANSECT{substr($ext1,0,1)};
	$precedence2 = $MANSECT{$ext2} || $MANSECT{substr($ext2,0,1)};

	$precedence1 == $precedence2  # if same defined precedence
	    ? $a cmp $b : 	      # use lexical compare 
	    $precedence2 <=> $precedence1; 
    }


This is also a good case for why perl needs its own built-in sort
operator.  While sometimes it's more efficient to call the sort binary,
and you should not be afraid to do so when this is the case, you still
need the built-in for the times when then are auxiliary criteria for your
sort that /usr/bin/sort simply cannot handle.

Back to sorting on just pieces.   As in any case where you're sorting on
pieces of the keys, you want to pull out the important pieces and stick
these in a list (numerically-indexed array) and then sort on just the
indices.  Let's look at a simpler case of this first.  Consider df output:
often you'd like to see it sorted not in mount order, but rather in some
other order, such as the amount of space used of free or by the percentage
of free space.  Now, you could of course do this:

	df | sort +$1rn 

where $1 is the column to sort on.  But then you're header gets at the
bottom of your display, and anyway, it'll prove a good demo.

Here's the little script:

    #!/usr/bin/perl
    $field = shift || 4;
    open(DF,"df -t 4.2|");
    print $_ = <DF>;   # output header first
    print &sort_by_field(<DF>);
    close(DF);
    exit $?;

    sub sort_by_field {
	local(@whole) = @_;
	local(@f);

	for (@whole) { push(@f, (split(' '))[$field]); }
	@whole[sort byfield $[..$#whole];
    }
    sub byfield { $f[$b] <=> $f[$a]; }

The default field to sort on will be #4, the percentage of free space.
Instead of passing all the lines to sort and constantly splitting, I first
build up an @f array, consisting of just the field I care about.
That's what this is doing:

    for (@whole) { push(@f, (split(' '))[$field]); }

I then take all the subscripts of @whole, which are $[ .. $#whole, and
pass these to by sort routine.  The key here is that the indices for both
the @whole array and the smaller @f one are the same, so if I sort one, I
sort the other.  The &byfield routine knows to use its parameters to
subscript @f.  The return from the sort will be the subscripts rearranged
according to the @f ordering, and I use this return list to subscript the
original @whole array.  Since I want a reverse sort, I switch $a and $b
around in the <=> comparison.  

Also notice the use of the @whole[@list] notation: it's important to
remember to use $x[] for a single scalar value, @x[] for a list.  I keep
reminding people this because it seems hard for folks to catch on at
first.  I got a program in the mail yesterday where this was the problem.
Just remember that the symbol ($, @) governs the object or objects
returned, not the object accessed (ok, kinda).  Thus $x[] and $x{} return
single values, @x[] and @x{} both return multiple values.

If you're with me so far, it's time to take the big plunge and go to 
our real topic for today: efficiently sorting N-dimensional arrays
in aritrary orders.  Assume that your array has 3 subscripts.  You 
should first go through it and make three parallel lists.

    @keys = keys %array;
    for $key (@keys) {
	($x, $y, $z) = split($;, $key);
	push(@x, $a);
	push(@y, $y);
	push(@z, $z);
    } 


Now your @x, @y, and @z lists contain their respective parts of the
composite keys, and these are ordered the same way that the keys built-in
pulls them out of %array.  Assume we're going to sort on 2, 1, and 0 as in
the first example given in this article.  This will do it for you:

    for $elt (sort by_keys @keys) {
	print $array{$elt}, "\n";
    }
    
    sub by_keys {
	$y[$a] <=> $y[$b] ||
	$z[$a] <=> $z[$b] ||
	$x[$a] <=> $x[$b];
    } 

It's true that %array is hard-coded, but that's ok.  If you surround
the preceding code fragments with this, that solves the problem:

    &sort_by_keys(*foo_array);  
    &sort_by_keys(*bar_array); # both 3-dim arrays

    sub sort_by_keys {
	local(*array) = @_;
	local(@x, @y, @z, @keys);
	....
    } 

There's a problem, though:  the order in &by_keys is still hard-wired in,
because I've got the names $y, $z, and $x specifically coded.  Sometimes
this is a bad thing, because you can't know the order before hand.   Imagine
that you call &sort_by_keys this way instead, including the desired ordering
in the call:
    
    &sort_by_keys(*foo_array, 2, 1, 0);
    &sort_by_keys(*bar_array, 0, 2, 1);

Now the set of functions looks like this:

    sub sort_by_keys {
	local(*array, @order) = @_;
	....
    } 

This way @order is the list of subscripts.  Here comes the
tricky part now.  After you've done pulled out your keys
into @x, @y, and @z, do this:
    
	local(*key0, *key1, *key2) = (*x, *y, *z)[@order];

This makes @key0, @key1, and @key2 aliases to whatever permututation of
@x, @y, and @z is given in the @order list.   I could have put my
pointers in a list:

    @ptrs = (*x, *y, *z);
    # then later...
    local(*key0, *key1, *key2) = @ptrs[@order];

This is what I do in the full-blown program at the end of this article.

Now you just make your &by_keys routine like this:

    sub by_keys {
	$key0[$a] <=> $key0[$b] ||
	$key1[$a] <=> $key1[$b] ||
	$key2[$a] <=> $key2[$b];
    } 

And it will all work out just fine.  Isn't that nifty?  And they say perl
hasn't got pointers.  Hah!

(You might ask yourself, what happens if @order is (2,1).  Well, that leave
@keys2 undefined, which is ok for dereferencing.  Unlike certain other
languages we fight, er, work with, perl is a foregiving language won't
coredump in your face just for dereferencing something that's undefined.
You get a nice undefined value back, which is for nearly all intents
and purposes, as good as a null value.)

One nice thing about this approach is that it uses no evals.  Evals are
sure nice, and make all thing possible, but they do run more slowly that a
straight reference, even through this magic symbol table pointer
dereference via the *name notation.  And besides, if we ever get a
perl-to-C compiler, evals will probably be out of the question.

Following is a real program that uses these techniques and a couple more.
Its purpose in life is to read accounting data from "sa -e" output, and
produce reports sorted on a variety of possible fields.  Since it takes a
long time to go through sa, you want to read the data only once, even if
you're doing multiple reports.  The possible fields are:

    1: 	user name
    2: 	group name
    3: 	activity name  # a convex special 
    4: 	cpu usage 
    5:	kcore ticks

If you invoke the program as

    actsum 512 415 32

then it will produce three different reports.  Each set of digits
represents the order you want that report sorted in.  There's also a -r
flag for reverse sorting.  This is implemented by keeping a flag around,
and if set, negating the return value of the sort routine.

I keep two associative arrays around, %cpu and %kcore, which keep a
running sum of these things used as indexed by the tuple ($uid, $gid,
$aid).   I split up the keys as described above into 3 parallel arrays
(@uids, @gids, @aids), and then also stuff the values of %cpu into @cpus
and those of %kcore into @kcore.  This gives me 5 parallel lists whose
indices are all they same, that is (keys %cpus).

I use pointers for a couple more things in this program: first, since
your sa output won't look like mine unless you're running a 9.0 or
greater version of ConvexOS, I've included a small set of representative
data in the program's DATA segment.  But you can use -a to override
this and really call sa.  I do it this way:

    unless ($opt_a) {
	*ACCT = *DATA;  # lost the old ACCT pointer; oh well...
    } else {
	$kid = open(ACCT, "sa -e|") || die "can't open pipe to sa: $!";
    }

So that makes ACCT an alias for DATA unless $opt_a is turned on.

The other use of pointers and dynamic scope is possibly abusive.  Some
of you have seen this before.  It's my old &id function for translating
uids and gids (and in this case, aids) from numeric to ascii.

    $uname  = &id(*uid);
    $gname  = &id(*gid);
    $aname  = &id(*aid);

    sub id {
	local(*id) = @_;  # NAME CHANGE ALERT
	$id{$id} = &id unless defined $id{$id};
	$id{$id};
    }

    sub uid { (getpwuid($id))[0]; }
    sub gid { (getgrgid($id))[0]; }
    sub aid { "#$id"; }  # %aid was already pre-loaded


Basically, the local(*id) assigment changes all instances of the name "id"
to be either "uid", "gid", or "aid" from that point forward ACCORDING TO
RUN TIME, not compile time -- dynamic scoping and all, you know.  So for
example if $uid{$uid} isn't defined, it calls getpwuid($uid) to get the
answer.  If you didn't catch all that, then don't feel bad.  This is one
of those things you just have to stare at a while, I guess.

Here's what the program outputs on my machine with this data set:

    % actsum -r 512 415 321

	    (reverse sorted by kcore ticks, user id, group id)
    uid      gid      aid                   cpu            kcore

    twebb    bin      default              0.04          7215238
    mbecker  swenv    default             35.26           173749
    root     bin      default             72.25           127521
    twebb    swenv    default             22.78            68966
    rmingee  swenv    default              2.88             1345
    tharel   swenv    default              2.27              494
    mbecker  swenv    comm_doc             0.03               17
    mbecker  swenv    test                 0.06               13
    root     zero     default              0.07               13


	    (reverse sorted by cpu time, user id, kcore ticks)
    uid      gid      aid                   cpu            kcore

    root     bin      default             72.25           127521
    mbecker  swenv    default             35.26           173749
    twebb    swenv    default             22.78            68966
    rmingee  swenv    default              2.88             1345
    tharel   swenv    default              2.27              494
    root     zero     default              0.07               13
    mbecker  swenv    test                 0.06               13
    twebb    bin      default              0.04          7215238
    mbecker  swenv    comm_doc             0.03               17


	    (reverse sorted by activity id, group id, user id)
    uid      gid      aid                   cpu            kcore

    mbecker  swenv    comm_doc             0.03               17
    mbecker  swenv    test                 0.06               13
    mbecker  swenv    default             35.26           173749
    twebb    swenv    default             22.78            68966
    rmingee  swenv    default              2.88             1345
    tharel   swenv    default              2.27              494
    twebb    bin      default              0.04          7215238
    root     bin      default             72.25           127521
    root     zero     default              0.07               13


You don't have my /etc/{passwd,group,activities} files, so you won't
get back the same names (or perhaps any at all).  But the sort order
will be the same.  Read through the program.  I think it's pretty 
instructive.  


--tom

#########################################################################

#!/usr/bin/perl

@names = ('user id', 'group id', 'activity id', 'cpu time', 'kcore ticks');

require 'getopts.pl';

&Getopts('rna') || &usage;

$reverse = $opt_r;

&usage unless @ARGV;

sub usage {
    select(STDERR);
    print <<EOF;
usage: $0 [-r -a] keys [ keys ... ]
	where each keys arg describes a separate report
	eg: $0 123 231 54 14
	    would generate 4 reports.  The first would be
	    ordered on columns 1, 2, and 3; the second on
	    columns 2, 3, and 1; the third on columns
	    5 and 4; the fourth on columns 1 and 4.
EOF
    print "\tcolumns are: ", join(', ', @names), "\n";
    print <<EOF;
		and begin at column 1, not 0.
	-r means to use a reverse sort
	-a means to really run sa for the data
	-n means leave ids numeric
EOF
    
    exit 1;
} 


unless ($opt_a) {
    *ACCT = *DATA;  # lost the old ACCT pointer; oh well...
} else {
    $kid = open(ACCT, "sa -e|") || die "can't open pipe to sa: $!";
} 


if ($opt_n) {
    eval <<'REDEFINE';
	    sub uid { "#$id"; } 
	    sub gid { "#$id"; } 
REDEFINE
} else {
    &load_activities;  # only meaningful on convexen
}

sub INFANTICIDE {
    if ($kid) {
	warn "interrupted after reading $. records\n";
	kill $kid;
    }
} 

$SIG{'INT'} = 'INFANTICIDE';
while (<ACCT>) {
    ($command, $starting_time, $cpu, $tty, $uid, $gid, $aid,
    $elapsed_time, $kilocore, $avio, $system_time, $concur, $sched) = split;

    if ($uid eq '') {
	warn "bad data line $.: $_";
	next;
    } 

    $kilocore =~ s/k\*sec//;		 # strip noise

    $kcore {$uid, $gid, $aid} += $kilocore;
    $cpu   {$uid, $gid, $aid} += $cpu;
} 
$SIG{'INT'} = 'DEFAULT';

for (@keys = keys %kcore) {
    ($uid, $gid, $aid) = split($;, $_);

    push(@uids,   $uid);
    push(@gids,   $gid);
    push(@aids,   $aid);
}

@cpus = values %cpu;
@kcores = values %kcore;

@key_ptrs = (*uids, *gids, *aids, *cpus, *kcores);

for (@ARGV) {
    &usage unless /^[1-5]+$/;
    @order = split(//);
    grep (--$_, @order); # make 0-based
    &display(@order);
} 

exit;

################################################

sub display {
    local(@indices) = @_;
    print "\n" if $count++;
    print "\t(";
    print 'reverse ' if $reverse;
    print "sorted by ", join(", ", @names[@indices]), ")\n";
    printf "%-8s %-8s %-16s %8s %16s\n\n", 
	"uid", "gid", "aid", "cpu", "kcore";

    &printkeys(&sort_indices(@indices));
    print "\n" unless $count == @ARGV;
} 

sub printkeys {
    local(@ordered_indices) = @_;
    local($uname, $gname, $aname);
    local($last_uname, $last_gname, $last_aname);
    local($i);

    for $i (@ordered_indices) {
	($uid, $gid, $aid) = ($uids[$i], $gids[$i], $aids[$i]);

	($uname, $gname, $aname) = (&id(*uid), &id(*gid), &id(*aid));

	printf "%-8s %-8s %-16s %8.2f %16d\n", 
	    $uname, $gname, $aname,
	    $cpu{$keys[$i]}, $kcore{$keys[$i]};
    } 
} 


sub sort_indices {
    local(@key_order) = @_;
    local(*key0, *key1, *key2, *key3, *key4) = @key_ptrs[@key_order];
    sort _by_id 0..$#key0;
} 

#########
# yes, all the keyN's might not be defined, and won't be
# unless the user asks for a 5-key sort, but perl is a forgiving
# language and won't (shouldn't?) core dump in your face.
#
sub _by_id {
    ($reverse ? -1 : 1) *
    ($key0[$a] <=> $key0[$b] ||
     $key1[$a] <=> $key1[$b] ||
     $key2[$a] <=> $key2[$b] ||
     $key3[$a] <=> $key3[$b] ||
     $key4[$a] <=> $key4[$b]  );
} 

sub id {
    local(*id) = @_;  # NAME CHANGE ALERT
    $id{$id} = &id unless defined $id{$id};
    $id{$id};
} 

sub uid { (getpwuid($id))[0]; } 
sub gid { (getgrgid($id))[0]; } 
sub aid { "#$id"; }  # %aid was already pre-loaded

sub load_activities {
    $ACTIVITIES = '/etc/activities';
    open ACTIVITIES || warn "cannot open $ACTIVITIES: $!";
    while (<ACTIVITIES>) {
	chop;
	($name, $number) = split(/:/);
	$aid{$number} = $name;
    } 
} 

# here's what a line looks like from "sa -e" on a convex:

__END__
xtmexecute  673249103    0.18 ttypa      1290    304          0   42.62s       66k*sec        9io    0.13    1.00    d
sh          673249103    0.03 ttypa      1290    304          0   42.66s        1k*sec        0io    0.02    1.00    d
lf          673249144    1.42 ttyr1      1177    304          0    2.16s      284k*sec       10io    1.38    1.00    d
cocc        673249060   28.78 ttyq1     25630    304          0   87.93s   165687k*sec       27io    1.00    1.00    d
cc          673249028    0.05 ttyq1     25630    304          0  120.45s        9k*sec        3io    0.05    1.00    d
vers        673249148    0.06 ttyq1     25630    304          100    0.14s       13k*sec        4io    0.06    1.00    d
make*       673249028    0.03 ttyq1     25630    304          2000000002  120.69s       17k*sec        2io    0.02    1.00    d
rm          673249149    0.02 ttyq1     25630    304          0    0.09s        4k*sec        2io    0.02    1.00    d
xtmexecute  673249146    0.15 ttypa      1290    304          0   13.73s       41k*sec        8io    0.11    1.00    d
sh          673249146    0.05 ttypa      1290    304          0   13.79s        9k*sec        0io    0.04    1.00    d
ld.fast     673249149    4.93 ttyq1     25630    304          0   11.68s     7067k*sec      136io    2.70    1.00    d
cc          673249149    0.02 ttyq1     25630    304          0   11.74s        6k*sec        0io    0.02    1.00    d
make*       673249149    0.02 ttyq1     25630    304          0   11.86s        7k*sec        0io    0.02    1.00    d
make        673249027    0.53 ttyq1     25630    304          0  133.42s      232k*sec       12io    0.13    1.00    d
Core_S1X    673249161    0.63 ttyq1     25630    304          0    1.48s      574k*sec       18io    0.29    1.00    d
perl        673249162    0.24 ttyq1     25630    304          0    0.48s      151k*sec        7io    0.11    1.00    d
rm          673249164    0.04 ttyq1     25630    304          0    0.04s       12k*sec        0io    0.04    1.00    d
cocc        673249130   14.06 ttyp4     25534    304          0   39.03s    54302k*sec       25io    0.71    1.00    d
cc          673249125    0.04 ttyp4     25534    10          0   44.54s  7215238k*sec        2io    0.04    1.00    d
vers        673249169    0.03 ttyp4     25534    304          0    0.05s       13k*sec        1io    0.03    1.00    d
make*       673249125    0.03 ttyp4     25534    304          0   44.65s        8k*sec        0io    0.02    1.00    d
rm          673249169    0.02 ttyp4     25534    304          0    0.02s        4k*sec        0io    0.02    1.00    d
cpp         673249170    2.40 ttyp4     25534    304          0    4.94s      863k*sec       73io    0.41    1.00    d
xtmexecute  673249160    0.15 ttypa      1290    304          0   15.23s       41k*sec        8io    0.11    1.00    d
sh          673249160    0.04 ttypa      1290    304          0   15.29s        5k*sec        0io    0.04    1.00    d
spucmd      673249177    0.03 __            0      0          0    3.19s        0k*sec        1io    0.03    1.00    d
sh          673249177    0.04 __            0      0          0    3.24s       13k*sec        0io    0.03    1.00    d
clear       673249182    0.04 ttyr1      1177    304          0    0.11s       10k*sec        4io    0.03    1.00    d
pwd         673249182    0.09 ttyr1      1177    304          0    0.22s       25k*sec        4io    0.09    1.00    d
lf          673249183    0.45 ttyr1      1177    304          0    0.49s       89k*sec        5io    0.44    1.00    d
lf          673249184    0.06 ttyr1      1177    304          0    0.07s       23k*sec        0io    0.06    1.00    d
sa          673248965   72.22 __            0     10          0  220.30s   127516k*sec      372io    3.06    1.00    d
logger      673249186    0.03 __            0     10          0    0.13s        5k*sec        6io    0.02    1.00    d
xtmexecute  673249175    2.25 ttypa      1290    304          0   11.48s     1177k*sec       22io    1.37    1.00    d
sh          673249175    0.03 ttypa      1290    304          0   11.51s        5k*sec        0io    0.02    1.00    d
cocc        673249175    6.09 ttyp4     25534    304          0   16.54s    13733k*sec       24io    0.37    1.00    d
cc          673249170    0.05 ttyp4     25534    304          0   21.64s        8k*sec        2io    0.04    1.00    d
vers        673249191    0.05 ttyp4     25534    304          0    0.06s       17k*sec        2io    0.05    1.00    d
make*       673249169    0.03 ttyp4     25534    304          0   21.76s       13k*sec        0io    0.02    1.00    d
rm          673249191    0.02 ttyp4     25534    304          0    0.03s        5k*sec        0io    0.02    1.00    d
tdpr        673249194    0.21 ttyr1      1177    304          0    0.74s       63k*sec       25io    0.10    1.00    d
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
		"So much mail, so little time."