tchrist@convex.com (Tom Christiansen) (05/15/91)
[Yesterday I offered to demonstrate sorting multi-dimensional arrays on
arbitrary keys efficiently and generically if anyone had any interest in
this. Well, my mailbox says there was a good deal of interest, so it here
it all is.]
The first thing to think about when sorting N-dim arrays in perl is that
there is no such thing, or rather, their internal represenation is the
same as a normal assoc array with the keys joined together by the $;
variable. This means that to do any comparisons, you'll need to split the
keys.
Let's say we want to sort on keys 2, then 1, then 0. You could approach
it this way (assume numeric sort):
for $elt (sort by_keys keys %array) {
print $array{$elt}, "\n";
}
sub sort_keys {
@a = split($;, $array{$a});
@b = split($;, $array{$b});
$a[2] <=> $b[2] ||
$a[1] <=> $b[1] ||
$a[0] <=> $b[0];
}
This approach, though, is one fraught with problems. One problem is that
it's got too much hardcoded: both the name of the array and which elements
to compare are hard-wired into the code. Yesterday, I showed how to avoid
this by using pass-by-name and dynamic scoping. You could also use an
eval to compose the subroutine on the fly.
A bigger problem is performance: to split the key on EACH compare means
something like N*logN splits, where N is the number of keys you've got,
assuming perl's sort is an O(N*logN) one, which I think it is if it calls
qsort(). This problem arises whenever what you're sorting is not really
just $a and $b, but rather some piece of them. If you find yourself
splitting or unpacking in a sort subroutine, or pulling things out with a
regexp match, then you're very probably doing too much work.
A case where this would be ok to do this anyway would be when your N were
very small. I do such a thing in my man program when I'm sorting
available man pages (like stty.1 and stty.3c) based on a user-defined
precedence. There are almost never more than topics than match on one
hand, so I don't mind pulling out the section from the page name with
a regexp match each time. Here's the sort subroutine I use there, where
what's being passed is filenames of man pages.
sub bysection {
local ($ext1, $ext2, $precedence1, $precedence2, $section1, $section2);
($section1, $ext1) = $a =~ m:/man([^/]+)/.*\.([^.]+)(\.Z)?$:;
($section2, $ext2) = $b =~ m:/man([^/]+)/.*\.([^.]+)(\.Z)?$:;
$ext1 = $section1 if $ext1 !~ /^${section1}.*/; # like mano/gtty.3
$ext2 = $section2 if $ext2 !~ /^${section2}.*/;
$precedence1 = $MANSECT{$ext1} || $MANSECT{substr($ext1,0,1)};
$precedence2 = $MANSECT{$ext2} || $MANSECT{substr($ext2,0,1)};
$precedence1 == $precedence2 # if same defined precedence
? $a cmp $b : # use lexical compare
$precedence2 <=> $precedence1;
}
This is also a good case for why perl needs its own built-in sort
operator. While sometimes it's more efficient to call the sort binary,
and you should not be afraid to do so when this is the case, you still
need the built-in for the times when then are auxiliary criteria for your
sort that /usr/bin/sort simply cannot handle.
Back to sorting on just pieces. As in any case where you're sorting on
pieces of the keys, you want to pull out the important pieces and stick
these in a list (numerically-indexed array) and then sort on just the
indices. Let's look at a simpler case of this first. Consider df output:
often you'd like to see it sorted not in mount order, but rather in some
other order, such as the amount of space used of free or by the percentage
of free space. Now, you could of course do this:
df | sort +$1rn
where $1 is the column to sort on. But then you're header gets at the
bottom of your display, and anyway, it'll prove a good demo.
Here's the little script:
#!/usr/bin/perl
$field = shift || 4;
open(DF,"df -t 4.2|");
print $_ = <DF>; # output header first
print &sort_by_field(<DF>);
close(DF);
exit $?;
sub sort_by_field {
local(@whole) = @_;
local(@f);
for (@whole) { push(@f, (split(' '))[$field]); }
@whole[sort byfield $[..$#whole];
}
sub byfield { $f[$b] <=> $f[$a]; }
The default field to sort on will be #4, the percentage of free space.
Instead of passing all the lines to sort and constantly splitting, I first
build up an @f array, consisting of just the field I care about.
That's what this is doing:
for (@whole) { push(@f, (split(' '))[$field]); }
I then take all the subscripts of @whole, which are $[ .. $#whole, and
pass these to by sort routine. The key here is that the indices for both
the @whole array and the smaller @f one are the same, so if I sort one, I
sort the other. The &byfield routine knows to use its parameters to
subscript @f. The return from the sort will be the subscripts rearranged
according to the @f ordering, and I use this return list to subscript the
original @whole array. Since I want a reverse sort, I switch $a and $b
around in the <=> comparison.
Also notice the use of the @whole[@list] notation: it's important to
remember to use $x[] for a single scalar value, @x[] for a list. I keep
reminding people this because it seems hard for folks to catch on at
first. I got a program in the mail yesterday where this was the problem.
Just remember that the symbol ($, @) governs the object or objects
returned, not the object accessed (ok, kinda). Thus $x[] and $x{} return
single values, @x[] and @x{} both return multiple values.
If you're with me so far, it's time to take the big plunge and go to
our real topic for today: efficiently sorting N-dimensional arrays
in aritrary orders. Assume that your array has 3 subscripts. You
should first go through it and make three parallel lists.
@keys = keys %array;
for $key (@keys) {
($x, $y, $z) = split($;, $key);
push(@x, $a);
push(@y, $y);
push(@z, $z);
}
Now your @x, @y, and @z lists contain their respective parts of the
composite keys, and these are ordered the same way that the keys built-in
pulls them out of %array. Assume we're going to sort on 2, 1, and 0 as in
the first example given in this article. This will do it for you:
for $elt (sort by_keys @keys) {
print $array{$elt}, "\n";
}
sub by_keys {
$y[$a] <=> $y[$b] ||
$z[$a] <=> $z[$b] ||
$x[$a] <=> $x[$b];
}
It's true that %array is hard-coded, but that's ok. If you surround
the preceding code fragments with this, that solves the problem:
&sort_by_keys(*foo_array);
&sort_by_keys(*bar_array); # both 3-dim arrays
sub sort_by_keys {
local(*array) = @_;
local(@x, @y, @z, @keys);
....
}
There's a problem, though: the order in &by_keys is still hard-wired in,
because I've got the names $y, $z, and $x specifically coded. Sometimes
this is a bad thing, because you can't know the order before hand. Imagine
that you call &sort_by_keys this way instead, including the desired ordering
in the call:
&sort_by_keys(*foo_array, 2, 1, 0);
&sort_by_keys(*bar_array, 0, 2, 1);
Now the set of functions looks like this:
sub sort_by_keys {
local(*array, @order) = @_;
....
}
This way @order is the list of subscripts. Here comes the
tricky part now. After you've done pulled out your keys
into @x, @y, and @z, do this:
local(*key0, *key1, *key2) = (*x, *y, *z)[@order];
This makes @key0, @key1, and @key2 aliases to whatever permututation of
@x, @y, and @z is given in the @order list. I could have put my
pointers in a list:
@ptrs = (*x, *y, *z);
# then later...
local(*key0, *key1, *key2) = @ptrs[@order];
This is what I do in the full-blown program at the end of this article.
Now you just make your &by_keys routine like this:
sub by_keys {
$key0[$a] <=> $key0[$b] ||
$key1[$a] <=> $key1[$b] ||
$key2[$a] <=> $key2[$b];
}
And it will all work out just fine. Isn't that nifty? And they say perl
hasn't got pointers. Hah!
(You might ask yourself, what happens if @order is (2,1). Well, that leave
@keys2 undefined, which is ok for dereferencing. Unlike certain other
languages we fight, er, work with, perl is a foregiving language won't
coredump in your face just for dereferencing something that's undefined.
You get a nice undefined value back, which is for nearly all intents
and purposes, as good as a null value.)
One nice thing about this approach is that it uses no evals. Evals are
sure nice, and make all thing possible, but they do run more slowly that a
straight reference, even through this magic symbol table pointer
dereference via the *name notation. And besides, if we ever get a
perl-to-C compiler, evals will probably be out of the question.
Following is a real program that uses these techniques and a couple more.
Its purpose in life is to read accounting data from "sa -e" output, and
produce reports sorted on a variety of possible fields. Since it takes a
long time to go through sa, you want to read the data only once, even if
you're doing multiple reports. The possible fields are:
1: user name
2: group name
3: activity name # a convex special
4: cpu usage
5: kcore ticks
If you invoke the program as
actsum 512 415 32
then it will produce three different reports. Each set of digits
represents the order you want that report sorted in. There's also a -r
flag for reverse sorting. This is implemented by keeping a flag around,
and if set, negating the return value of the sort routine.
I keep two associative arrays around, %cpu and %kcore, which keep a
running sum of these things used as indexed by the tuple ($uid, $gid,
$aid). I split up the keys as described above into 3 parallel arrays
(@uids, @gids, @aids), and then also stuff the values of %cpu into @cpus
and those of %kcore into @kcore. This gives me 5 parallel lists whose
indices are all they same, that is (keys %cpus).
I use pointers for a couple more things in this program: first, since
your sa output won't look like mine unless you're running a 9.0 or
greater version of ConvexOS, I've included a small set of representative
data in the program's DATA segment. But you can use -a to override
this and really call sa. I do it this way:
unless ($opt_a) {
*ACCT = *DATA; # lost the old ACCT pointer; oh well...
} else {
$kid = open(ACCT, "sa -e|") || die "can't open pipe to sa: $!";
}
So that makes ACCT an alias for DATA unless $opt_a is turned on.
The other use of pointers and dynamic scope is possibly abusive. Some
of you have seen this before. It's my old &id function for translating
uids and gids (and in this case, aids) from numeric to ascii.
$uname = &id(*uid);
$gname = &id(*gid);
$aname = &id(*aid);
sub id {
local(*id) = @_; # NAME CHANGE ALERT
$id{$id} = &id unless defined $id{$id};
$id{$id};
}
sub uid { (getpwuid($id))[0]; }
sub gid { (getgrgid($id))[0]; }
sub aid { "#$id"; } # %aid was already pre-loaded
Basically, the local(*id) assigment changes all instances of the name "id"
to be either "uid", "gid", or "aid" from that point forward ACCORDING TO
RUN TIME, not compile time -- dynamic scoping and all, you know. So for
example if $uid{$uid} isn't defined, it calls getpwuid($uid) to get the
answer. If you didn't catch all that, then don't feel bad. This is one
of those things you just have to stare at a while, I guess.
Here's what the program outputs on my machine with this data set:
% actsum -r 512 415 321
(reverse sorted by kcore ticks, user id, group id)
uid gid aid cpu kcore
twebb bin default 0.04 7215238
mbecker swenv default 35.26 173749
root bin default 72.25 127521
twebb swenv default 22.78 68966
rmingee swenv default 2.88 1345
tharel swenv default 2.27 494
mbecker swenv comm_doc 0.03 17
mbecker swenv test 0.06 13
root zero default 0.07 13
(reverse sorted by cpu time, user id, kcore ticks)
uid gid aid cpu kcore
root bin default 72.25 127521
mbecker swenv default 35.26 173749
twebb swenv default 22.78 68966
rmingee swenv default 2.88 1345
tharel swenv default 2.27 494
root zero default 0.07 13
mbecker swenv test 0.06 13
twebb bin default 0.04 7215238
mbecker swenv comm_doc 0.03 17
(reverse sorted by activity id, group id, user id)
uid gid aid cpu kcore
mbecker swenv comm_doc 0.03 17
mbecker swenv test 0.06 13
mbecker swenv default 35.26 173749
twebb swenv default 22.78 68966
rmingee swenv default 2.88 1345
tharel swenv default 2.27 494
twebb bin default 0.04 7215238
root bin default 72.25 127521
root zero default 0.07 13
You don't have my /etc/{passwd,group,activities} files, so you won't
get back the same names (or perhaps any at all). But the sort order
will be the same. Read through the program. I think it's pretty
instructive.
--tom
#########################################################################
#!/usr/bin/perl
@names = ('user id', 'group id', 'activity id', 'cpu time', 'kcore ticks');
require 'getopts.pl';
&Getopts('rna') || &usage;
$reverse = $opt_r;
&usage unless @ARGV;
sub usage {
select(STDERR);
print <<EOF;
usage: $0 [-r -a] keys [ keys ... ]
where each keys arg describes a separate report
eg: $0 123 231 54 14
would generate 4 reports. The first would be
ordered on columns 1, 2, and 3; the second on
columns 2, 3, and 1; the third on columns
5 and 4; the fourth on columns 1 and 4.
EOF
print "\tcolumns are: ", join(', ', @names), "\n";
print <<EOF;
and begin at column 1, not 0.
-r means to use a reverse sort
-a means to really run sa for the data
-n means leave ids numeric
EOF
exit 1;
}
unless ($opt_a) {
*ACCT = *DATA; # lost the old ACCT pointer; oh well...
} else {
$kid = open(ACCT, "sa -e|") || die "can't open pipe to sa: $!";
}
if ($opt_n) {
eval <<'REDEFINE';
sub uid { "#$id"; }
sub gid { "#$id"; }
REDEFINE
} else {
&load_activities; # only meaningful on convexen
}
sub INFANTICIDE {
if ($kid) {
warn "interrupted after reading $. records\n";
kill $kid;
}
}
$SIG{'INT'} = 'INFANTICIDE';
while (<ACCT>) {
($command, $starting_time, $cpu, $tty, $uid, $gid, $aid,
$elapsed_time, $kilocore, $avio, $system_time, $concur, $sched) = split;
if ($uid eq '') {
warn "bad data line $.: $_";
next;
}
$kilocore =~ s/k\*sec//; # strip noise
$kcore {$uid, $gid, $aid} += $kilocore;
$cpu {$uid, $gid, $aid} += $cpu;
}
$SIG{'INT'} = 'DEFAULT';
for (@keys = keys %kcore) {
($uid, $gid, $aid) = split($;, $_);
push(@uids, $uid);
push(@gids, $gid);
push(@aids, $aid);
}
@cpus = values %cpu;
@kcores = values %kcore;
@key_ptrs = (*uids, *gids, *aids, *cpus, *kcores);
for (@ARGV) {
&usage unless /^[1-5]+$/;
@order = split(//);
grep (--$_, @order); # make 0-based
&display(@order);
}
exit;
################################################
sub display {
local(@indices) = @_;
print "\n" if $count++;
print "\t(";
print 'reverse ' if $reverse;
print "sorted by ", join(", ", @names[@indices]), ")\n";
printf "%-8s %-8s %-16s %8s %16s\n\n",
"uid", "gid", "aid", "cpu", "kcore";
&printkeys(&sort_indices(@indices));
print "\n" unless $count == @ARGV;
}
sub printkeys {
local(@ordered_indices) = @_;
local($uname, $gname, $aname);
local($last_uname, $last_gname, $last_aname);
local($i);
for $i (@ordered_indices) {
($uid, $gid, $aid) = ($uids[$i], $gids[$i], $aids[$i]);
($uname, $gname, $aname) = (&id(*uid), &id(*gid), &id(*aid));
printf "%-8s %-8s %-16s %8.2f %16d\n",
$uname, $gname, $aname,
$cpu{$keys[$i]}, $kcore{$keys[$i]};
}
}
sub sort_indices {
local(@key_order) = @_;
local(*key0, *key1, *key2, *key3, *key4) = @key_ptrs[@key_order];
sort _by_id 0..$#key0;
}
#########
# yes, all the keyN's might not be defined, and won't be
# unless the user asks for a 5-key sort, but perl is a forgiving
# language and won't (shouldn't?) core dump in your face.
#
sub _by_id {
($reverse ? -1 : 1) *
($key0[$a] <=> $key0[$b] ||
$key1[$a] <=> $key1[$b] ||
$key2[$a] <=> $key2[$b] ||
$key3[$a] <=> $key3[$b] ||
$key4[$a] <=> $key4[$b] );
}
sub id {
local(*id) = @_; # NAME CHANGE ALERT
$id{$id} = &id unless defined $id{$id};
$id{$id};
}
sub uid { (getpwuid($id))[0]; }
sub gid { (getgrgid($id))[0]; }
sub aid { "#$id"; } # %aid was already pre-loaded
sub load_activities {
$ACTIVITIES = '/etc/activities';
open ACTIVITIES || warn "cannot open $ACTIVITIES: $!";
while (<ACTIVITIES>) {
chop;
($name, $number) = split(/:/);
$aid{$number} = $name;
}
}
# here's what a line looks like from "sa -e" on a convex:
__END__
xtmexecute 673249103 0.18 ttypa 1290 304 0 42.62s 66k*sec 9io 0.13 1.00 d
sh 673249103 0.03 ttypa 1290 304 0 42.66s 1k*sec 0io 0.02 1.00 d
lf 673249144 1.42 ttyr1 1177 304 0 2.16s 284k*sec 10io 1.38 1.00 d
cocc 673249060 28.78 ttyq1 25630 304 0 87.93s 165687k*sec 27io 1.00 1.00 d
cc 673249028 0.05 ttyq1 25630 304 0 120.45s 9k*sec 3io 0.05 1.00 d
vers 673249148 0.06 ttyq1 25630 304 100 0.14s 13k*sec 4io 0.06 1.00 d
make* 673249028 0.03 ttyq1 25630 304 2000000002 120.69s 17k*sec 2io 0.02 1.00 d
rm 673249149 0.02 ttyq1 25630 304 0 0.09s 4k*sec 2io 0.02 1.00 d
xtmexecute 673249146 0.15 ttypa 1290 304 0 13.73s 41k*sec 8io 0.11 1.00 d
sh 673249146 0.05 ttypa 1290 304 0 13.79s 9k*sec 0io 0.04 1.00 d
ld.fast 673249149 4.93 ttyq1 25630 304 0 11.68s 7067k*sec 136io 2.70 1.00 d
cc 673249149 0.02 ttyq1 25630 304 0 11.74s 6k*sec 0io 0.02 1.00 d
make* 673249149 0.02 ttyq1 25630 304 0 11.86s 7k*sec 0io 0.02 1.00 d
make 673249027 0.53 ttyq1 25630 304 0 133.42s 232k*sec 12io 0.13 1.00 d
Core_S1X 673249161 0.63 ttyq1 25630 304 0 1.48s 574k*sec 18io 0.29 1.00 d
perl 673249162 0.24 ttyq1 25630 304 0 0.48s 151k*sec 7io 0.11 1.00 d
rm 673249164 0.04 ttyq1 25630 304 0 0.04s 12k*sec 0io 0.04 1.00 d
cocc 673249130 14.06 ttyp4 25534 304 0 39.03s 54302k*sec 25io 0.71 1.00 d
cc 673249125 0.04 ttyp4 25534 10 0 44.54s 7215238k*sec 2io 0.04 1.00 d
vers 673249169 0.03 ttyp4 25534 304 0 0.05s 13k*sec 1io 0.03 1.00 d
make* 673249125 0.03 ttyp4 25534 304 0 44.65s 8k*sec 0io 0.02 1.00 d
rm 673249169 0.02 ttyp4 25534 304 0 0.02s 4k*sec 0io 0.02 1.00 d
cpp 673249170 2.40 ttyp4 25534 304 0 4.94s 863k*sec 73io 0.41 1.00 d
xtmexecute 673249160 0.15 ttypa 1290 304 0 15.23s 41k*sec 8io 0.11 1.00 d
sh 673249160 0.04 ttypa 1290 304 0 15.29s 5k*sec 0io 0.04 1.00 d
spucmd 673249177 0.03 __ 0 0 0 3.19s 0k*sec 1io 0.03 1.00 d
sh 673249177 0.04 __ 0 0 0 3.24s 13k*sec 0io 0.03 1.00 d
clear 673249182 0.04 ttyr1 1177 304 0 0.11s 10k*sec 4io 0.03 1.00 d
pwd 673249182 0.09 ttyr1 1177 304 0 0.22s 25k*sec 4io 0.09 1.00 d
lf 673249183 0.45 ttyr1 1177 304 0 0.49s 89k*sec 5io 0.44 1.00 d
lf 673249184 0.06 ttyr1 1177 304 0 0.07s 23k*sec 0io 0.06 1.00 d
sa 673248965 72.22 __ 0 10 0 220.30s 127516k*sec 372io 3.06 1.00 d
logger 673249186 0.03 __ 0 10 0 0.13s 5k*sec 6io 0.02 1.00 d
xtmexecute 673249175 2.25 ttypa 1290 304 0 11.48s 1177k*sec 22io 1.37 1.00 d
sh 673249175 0.03 ttypa 1290 304 0 11.51s 5k*sec 0io 0.02 1.00 d
cocc 673249175 6.09 ttyp4 25534 304 0 16.54s 13733k*sec 24io 0.37 1.00 d
cc 673249170 0.05 ttyp4 25534 304 0 21.64s 8k*sec 2io 0.04 1.00 d
vers 673249191 0.05 ttyp4 25534 304 0 0.06s 17k*sec 2io 0.05 1.00 d
make* 673249169 0.03 ttyp4 25534 304 0 21.76s 13k*sec 0io 0.02 1.00 d
rm 673249191 0.02 ttyp4 25534 304 0 0.03s 5k*sec 0io 0.02 1.00 d
tdpr 673249194 0.21 ttyr1 1177 304 0 0.74s 63k*sec 25io 0.10 1.00 d
--
Tom Christiansen tchrist@convex.com convex!tchrist
"So much mail, so little time."