tchrist@convex.com (Tom Christiansen) (05/15/91)
[Yesterday I offered to demonstrate sorting multi-dimensional arrays on arbitrary keys efficiently and generically if anyone had any interest in this. Well, my mailbox says there was a good deal of interest, so it here it all is.] The first thing to think about when sorting N-dim arrays in perl is that there is no such thing, or rather, their internal represenation is the same as a normal assoc array with the keys joined together by the $; variable. This means that to do any comparisons, you'll need to split the keys. Let's say we want to sort on keys 2, then 1, then 0. You could approach it this way (assume numeric sort): for $elt (sort by_keys keys %array) { print $array{$elt}, "\n"; } sub sort_keys { @a = split($;, $array{$a}); @b = split($;, $array{$b}); $a[2] <=> $b[2] || $a[1] <=> $b[1] || $a[0] <=> $b[0]; } This approach, though, is one fraught with problems. One problem is that it's got too much hardcoded: both the name of the array and which elements to compare are hard-wired into the code. Yesterday, I showed how to avoid this by using pass-by-name and dynamic scoping. You could also use an eval to compose the subroutine on the fly. A bigger problem is performance: to split the key on EACH compare means something like N*logN splits, where N is the number of keys you've got, assuming perl's sort is an O(N*logN) one, which I think it is if it calls qsort(). This problem arises whenever what you're sorting is not really just $a and $b, but rather some piece of them. If you find yourself splitting or unpacking in a sort subroutine, or pulling things out with a regexp match, then you're very probably doing too much work. A case where this would be ok to do this anyway would be when your N were very small. I do such a thing in my man program when I'm sorting available man pages (like stty.1 and stty.3c) based on a user-defined precedence. There are almost never more than topics than match on one hand, so I don't mind pulling out the section from the page name with a regexp match each time. Here's the sort subroutine I use there, where what's being passed is filenames of man pages. sub bysection { local ($ext1, $ext2, $precedence1, $precedence2, $section1, $section2); ($section1, $ext1) = $a =~ m:/man([^/]+)/.*\.([^.]+)(\.Z)?$:; ($section2, $ext2) = $b =~ m:/man([^/]+)/.*\.([^.]+)(\.Z)?$:; $ext1 = $section1 if $ext1 !~ /^${section1}.*/; # like mano/gtty.3 $ext2 = $section2 if $ext2 !~ /^${section2}.*/; $precedence1 = $MANSECT{$ext1} || $MANSECT{substr($ext1,0,1)}; $precedence2 = $MANSECT{$ext2} || $MANSECT{substr($ext2,0,1)}; $precedence1 == $precedence2 # if same defined precedence ? $a cmp $b : # use lexical compare $precedence2 <=> $precedence1; } This is also a good case for why perl needs its own built-in sort operator. While sometimes it's more efficient to call the sort binary, and you should not be afraid to do so when this is the case, you still need the built-in for the times when then are auxiliary criteria for your sort that /usr/bin/sort simply cannot handle. Back to sorting on just pieces. As in any case where you're sorting on pieces of the keys, you want to pull out the important pieces and stick these in a list (numerically-indexed array) and then sort on just the indices. Let's look at a simpler case of this first. Consider df output: often you'd like to see it sorted not in mount order, but rather in some other order, such as the amount of space used of free or by the percentage of free space. Now, you could of course do this: df | sort +$1rn where $1 is the column to sort on. But then you're header gets at the bottom of your display, and anyway, it'll prove a good demo. Here's the little script: #!/usr/bin/perl $field = shift || 4; open(DF,"df -t 4.2|"); print $_ = <DF>; # output header first print &sort_by_field(<DF>); close(DF); exit $?; sub sort_by_field { local(@whole) = @_; local(@f); for (@whole) { push(@f, (split(' '))[$field]); } @whole[sort byfield $[..$#whole]; } sub byfield { $f[$b] <=> $f[$a]; } The default field to sort on will be #4, the percentage of free space. Instead of passing all the lines to sort and constantly splitting, I first build up an @f array, consisting of just the field I care about. That's what this is doing: for (@whole) { push(@f, (split(' '))[$field]); } I then take all the subscripts of @whole, which are $[ .. $#whole, and pass these to by sort routine. The key here is that the indices for both the @whole array and the smaller @f one are the same, so if I sort one, I sort the other. The &byfield routine knows to use its parameters to subscript @f. The return from the sort will be the subscripts rearranged according to the @f ordering, and I use this return list to subscript the original @whole array. Since I want a reverse sort, I switch $a and $b around in the <=> comparison. Also notice the use of the @whole[@list] notation: it's important to remember to use $x[] for a single scalar value, @x[] for a list. I keep reminding people this because it seems hard for folks to catch on at first. I got a program in the mail yesterday where this was the problem. Just remember that the symbol ($, @) governs the object or objects returned, not the object accessed (ok, kinda). Thus $x[] and $x{} return single values, @x[] and @x{} both return multiple values. If you're with me so far, it's time to take the big plunge and go to our real topic for today: efficiently sorting N-dimensional arrays in aritrary orders. Assume that your array has 3 subscripts. You should first go through it and make three parallel lists. @keys = keys %array; for $key (@keys) { ($x, $y, $z) = split($;, $key); push(@x, $a); push(@y, $y); push(@z, $z); } Now your @x, @y, and @z lists contain their respective parts of the composite keys, and these are ordered the same way that the keys built-in pulls them out of %array. Assume we're going to sort on 2, 1, and 0 as in the first example given in this article. This will do it for you: for $elt (sort by_keys @keys) { print $array{$elt}, "\n"; } sub by_keys { $y[$a] <=> $y[$b] || $z[$a] <=> $z[$b] || $x[$a] <=> $x[$b]; } It's true that %array is hard-coded, but that's ok. If you surround the preceding code fragments with this, that solves the problem: &sort_by_keys(*foo_array); &sort_by_keys(*bar_array); # both 3-dim arrays sub sort_by_keys { local(*array) = @_; local(@x, @y, @z, @keys); .... } There's a problem, though: the order in &by_keys is still hard-wired in, because I've got the names $y, $z, and $x specifically coded. Sometimes this is a bad thing, because you can't know the order before hand. Imagine that you call &sort_by_keys this way instead, including the desired ordering in the call: &sort_by_keys(*foo_array, 2, 1, 0); &sort_by_keys(*bar_array, 0, 2, 1); Now the set of functions looks like this: sub sort_by_keys { local(*array, @order) = @_; .... } This way @order is the list of subscripts. Here comes the tricky part now. After you've done pulled out your keys into @x, @y, and @z, do this: local(*key0, *key1, *key2) = (*x, *y, *z)[@order]; This makes @key0, @key1, and @key2 aliases to whatever permututation of @x, @y, and @z is given in the @order list. I could have put my pointers in a list: @ptrs = (*x, *y, *z); # then later... local(*key0, *key1, *key2) = @ptrs[@order]; This is what I do in the full-blown program at the end of this article. Now you just make your &by_keys routine like this: sub by_keys { $key0[$a] <=> $key0[$b] || $key1[$a] <=> $key1[$b] || $key2[$a] <=> $key2[$b]; } And it will all work out just fine. Isn't that nifty? And they say perl hasn't got pointers. Hah! (You might ask yourself, what happens if @order is (2,1). Well, that leave @keys2 undefined, which is ok for dereferencing. Unlike certain other languages we fight, er, work with, perl is a foregiving language won't coredump in your face just for dereferencing something that's undefined. You get a nice undefined value back, which is for nearly all intents and purposes, as good as a null value.) One nice thing about this approach is that it uses no evals. Evals are sure nice, and make all thing possible, but they do run more slowly that a straight reference, even through this magic symbol table pointer dereference via the *name notation. And besides, if we ever get a perl-to-C compiler, evals will probably be out of the question. Following is a real program that uses these techniques and a couple more. Its purpose in life is to read accounting data from "sa -e" output, and produce reports sorted on a variety of possible fields. Since it takes a long time to go through sa, you want to read the data only once, even if you're doing multiple reports. The possible fields are: 1: user name 2: group name 3: activity name # a convex special 4: cpu usage 5: kcore ticks If you invoke the program as actsum 512 415 32 then it will produce three different reports. Each set of digits represents the order you want that report sorted in. There's also a -r flag for reverse sorting. This is implemented by keeping a flag around, and if set, negating the return value of the sort routine. I keep two associative arrays around, %cpu and %kcore, which keep a running sum of these things used as indexed by the tuple ($uid, $gid, $aid). I split up the keys as described above into 3 parallel arrays (@uids, @gids, @aids), and then also stuff the values of %cpu into @cpus and those of %kcore into @kcore. This gives me 5 parallel lists whose indices are all they same, that is (keys %cpus). I use pointers for a couple more things in this program: first, since your sa output won't look like mine unless you're running a 9.0 or greater version of ConvexOS, I've included a small set of representative data in the program's DATA segment. But you can use -a to override this and really call sa. I do it this way: unless ($opt_a) { *ACCT = *DATA; # lost the old ACCT pointer; oh well... } else { $kid = open(ACCT, "sa -e|") || die "can't open pipe to sa: $!"; } So that makes ACCT an alias for DATA unless $opt_a is turned on. The other use of pointers and dynamic scope is possibly abusive. Some of you have seen this before. It's my old &id function for translating uids and gids (and in this case, aids) from numeric to ascii. $uname = &id(*uid); $gname = &id(*gid); $aname = &id(*aid); sub id { local(*id) = @_; # NAME CHANGE ALERT $id{$id} = &id unless defined $id{$id}; $id{$id}; } sub uid { (getpwuid($id))[0]; } sub gid { (getgrgid($id))[0]; } sub aid { "#$id"; } # %aid was already pre-loaded Basically, the local(*id) assigment changes all instances of the name "id" to be either "uid", "gid", or "aid" from that point forward ACCORDING TO RUN TIME, not compile time -- dynamic scoping and all, you know. So for example if $uid{$uid} isn't defined, it calls getpwuid($uid) to get the answer. If you didn't catch all that, then don't feel bad. This is one of those things you just have to stare at a while, I guess. Here's what the program outputs on my machine with this data set: % actsum -r 512 415 321 (reverse sorted by kcore ticks, user id, group id) uid gid aid cpu kcore twebb bin default 0.04 7215238 mbecker swenv default 35.26 173749 root bin default 72.25 127521 twebb swenv default 22.78 68966 rmingee swenv default 2.88 1345 tharel swenv default 2.27 494 mbecker swenv comm_doc 0.03 17 mbecker swenv test 0.06 13 root zero default 0.07 13 (reverse sorted by cpu time, user id, kcore ticks) uid gid aid cpu kcore root bin default 72.25 127521 mbecker swenv default 35.26 173749 twebb swenv default 22.78 68966 rmingee swenv default 2.88 1345 tharel swenv default 2.27 494 root zero default 0.07 13 mbecker swenv test 0.06 13 twebb bin default 0.04 7215238 mbecker swenv comm_doc 0.03 17 (reverse sorted by activity id, group id, user id) uid gid aid cpu kcore mbecker swenv comm_doc 0.03 17 mbecker swenv test 0.06 13 mbecker swenv default 35.26 173749 twebb swenv default 22.78 68966 rmingee swenv default 2.88 1345 tharel swenv default 2.27 494 twebb bin default 0.04 7215238 root bin default 72.25 127521 root zero default 0.07 13 You don't have my /etc/{passwd,group,activities} files, so you won't get back the same names (or perhaps any at all). But the sort order will be the same. Read through the program. I think it's pretty instructive. --tom ######################################################################### #!/usr/bin/perl @names = ('user id', 'group id', 'activity id', 'cpu time', 'kcore ticks'); require 'getopts.pl'; &Getopts('rna') || &usage; $reverse = $opt_r; &usage unless @ARGV; sub usage { select(STDERR); print <<EOF; usage: $0 [-r -a] keys [ keys ... ] where each keys arg describes a separate report eg: $0 123 231 54 14 would generate 4 reports. The first would be ordered on columns 1, 2, and 3; the second on columns 2, 3, and 1; the third on columns 5 and 4; the fourth on columns 1 and 4. EOF print "\tcolumns are: ", join(', ', @names), "\n"; print <<EOF; and begin at column 1, not 0. -r means to use a reverse sort -a means to really run sa for the data -n means leave ids numeric EOF exit 1; } unless ($opt_a) { *ACCT = *DATA; # lost the old ACCT pointer; oh well... } else { $kid = open(ACCT, "sa -e|") || die "can't open pipe to sa: $!"; } if ($opt_n) { eval <<'REDEFINE'; sub uid { "#$id"; } sub gid { "#$id"; } REDEFINE } else { &load_activities; # only meaningful on convexen } sub INFANTICIDE { if ($kid) { warn "interrupted after reading $. records\n"; kill $kid; } } $SIG{'INT'} = 'INFANTICIDE'; while (<ACCT>) { ($command, $starting_time, $cpu, $tty, $uid, $gid, $aid, $elapsed_time, $kilocore, $avio, $system_time, $concur, $sched) = split; if ($uid eq '') { warn "bad data line $.: $_"; next; } $kilocore =~ s/k\*sec//; # strip noise $kcore {$uid, $gid, $aid} += $kilocore; $cpu {$uid, $gid, $aid} += $cpu; } $SIG{'INT'} = 'DEFAULT'; for (@keys = keys %kcore) { ($uid, $gid, $aid) = split($;, $_); push(@uids, $uid); push(@gids, $gid); push(@aids, $aid); } @cpus = values %cpu; @kcores = values %kcore; @key_ptrs = (*uids, *gids, *aids, *cpus, *kcores); for (@ARGV) { &usage unless /^[1-5]+$/; @order = split(//); grep (--$_, @order); # make 0-based &display(@order); } exit; ################################################ sub display { local(@indices) = @_; print "\n" if $count++; print "\t("; print 'reverse ' if $reverse; print "sorted by ", join(", ", @names[@indices]), ")\n"; printf "%-8s %-8s %-16s %8s %16s\n\n", "uid", "gid", "aid", "cpu", "kcore"; &printkeys(&sort_indices(@indices)); print "\n" unless $count == @ARGV; } sub printkeys { local(@ordered_indices) = @_; local($uname, $gname, $aname); local($last_uname, $last_gname, $last_aname); local($i); for $i (@ordered_indices) { ($uid, $gid, $aid) = ($uids[$i], $gids[$i], $aids[$i]); ($uname, $gname, $aname) = (&id(*uid), &id(*gid), &id(*aid)); printf "%-8s %-8s %-16s %8.2f %16d\n", $uname, $gname, $aname, $cpu{$keys[$i]}, $kcore{$keys[$i]}; } } sub sort_indices { local(@key_order) = @_; local(*key0, *key1, *key2, *key3, *key4) = @key_ptrs[@key_order]; sort _by_id 0..$#key0; } ######### # yes, all the keyN's might not be defined, and won't be # unless the user asks for a 5-key sort, but perl is a forgiving # language and won't (shouldn't?) core dump in your face. # sub _by_id { ($reverse ? -1 : 1) * ($key0[$a] <=> $key0[$b] || $key1[$a] <=> $key1[$b] || $key2[$a] <=> $key2[$b] || $key3[$a] <=> $key3[$b] || $key4[$a] <=> $key4[$b] ); } sub id { local(*id) = @_; # NAME CHANGE ALERT $id{$id} = &id unless defined $id{$id}; $id{$id}; } sub uid { (getpwuid($id))[0]; } sub gid { (getgrgid($id))[0]; } sub aid { "#$id"; } # %aid was already pre-loaded sub load_activities { $ACTIVITIES = '/etc/activities'; open ACTIVITIES || warn "cannot open $ACTIVITIES: $!"; while (<ACTIVITIES>) { chop; ($name, $number) = split(/:/); $aid{$number} = $name; } } # here's what a line looks like from "sa -e" on a convex: __END__ xtmexecute 673249103 0.18 ttypa 1290 304 0 42.62s 66k*sec 9io 0.13 1.00 d sh 673249103 0.03 ttypa 1290 304 0 42.66s 1k*sec 0io 0.02 1.00 d lf 673249144 1.42 ttyr1 1177 304 0 2.16s 284k*sec 10io 1.38 1.00 d cocc 673249060 28.78 ttyq1 25630 304 0 87.93s 165687k*sec 27io 1.00 1.00 d cc 673249028 0.05 ttyq1 25630 304 0 120.45s 9k*sec 3io 0.05 1.00 d vers 673249148 0.06 ttyq1 25630 304 100 0.14s 13k*sec 4io 0.06 1.00 d make* 673249028 0.03 ttyq1 25630 304 2000000002 120.69s 17k*sec 2io 0.02 1.00 d rm 673249149 0.02 ttyq1 25630 304 0 0.09s 4k*sec 2io 0.02 1.00 d xtmexecute 673249146 0.15 ttypa 1290 304 0 13.73s 41k*sec 8io 0.11 1.00 d sh 673249146 0.05 ttypa 1290 304 0 13.79s 9k*sec 0io 0.04 1.00 d ld.fast 673249149 4.93 ttyq1 25630 304 0 11.68s 7067k*sec 136io 2.70 1.00 d cc 673249149 0.02 ttyq1 25630 304 0 11.74s 6k*sec 0io 0.02 1.00 d make* 673249149 0.02 ttyq1 25630 304 0 11.86s 7k*sec 0io 0.02 1.00 d make 673249027 0.53 ttyq1 25630 304 0 133.42s 232k*sec 12io 0.13 1.00 d Core_S1X 673249161 0.63 ttyq1 25630 304 0 1.48s 574k*sec 18io 0.29 1.00 d perl 673249162 0.24 ttyq1 25630 304 0 0.48s 151k*sec 7io 0.11 1.00 d rm 673249164 0.04 ttyq1 25630 304 0 0.04s 12k*sec 0io 0.04 1.00 d cocc 673249130 14.06 ttyp4 25534 304 0 39.03s 54302k*sec 25io 0.71 1.00 d cc 673249125 0.04 ttyp4 25534 10 0 44.54s 7215238k*sec 2io 0.04 1.00 d vers 673249169 0.03 ttyp4 25534 304 0 0.05s 13k*sec 1io 0.03 1.00 d make* 673249125 0.03 ttyp4 25534 304 0 44.65s 8k*sec 0io 0.02 1.00 d rm 673249169 0.02 ttyp4 25534 304 0 0.02s 4k*sec 0io 0.02 1.00 d cpp 673249170 2.40 ttyp4 25534 304 0 4.94s 863k*sec 73io 0.41 1.00 d xtmexecute 673249160 0.15 ttypa 1290 304 0 15.23s 41k*sec 8io 0.11 1.00 d sh 673249160 0.04 ttypa 1290 304 0 15.29s 5k*sec 0io 0.04 1.00 d spucmd 673249177 0.03 __ 0 0 0 3.19s 0k*sec 1io 0.03 1.00 d sh 673249177 0.04 __ 0 0 0 3.24s 13k*sec 0io 0.03 1.00 d clear 673249182 0.04 ttyr1 1177 304 0 0.11s 10k*sec 4io 0.03 1.00 d pwd 673249182 0.09 ttyr1 1177 304 0 0.22s 25k*sec 4io 0.09 1.00 d lf 673249183 0.45 ttyr1 1177 304 0 0.49s 89k*sec 5io 0.44 1.00 d lf 673249184 0.06 ttyr1 1177 304 0 0.07s 23k*sec 0io 0.06 1.00 d sa 673248965 72.22 __ 0 10 0 220.30s 127516k*sec 372io 3.06 1.00 d logger 673249186 0.03 __ 0 10 0 0.13s 5k*sec 6io 0.02 1.00 d xtmexecute 673249175 2.25 ttypa 1290 304 0 11.48s 1177k*sec 22io 1.37 1.00 d sh 673249175 0.03 ttypa 1290 304 0 11.51s 5k*sec 0io 0.02 1.00 d cocc 673249175 6.09 ttyp4 25534 304 0 16.54s 13733k*sec 24io 0.37 1.00 d cc 673249170 0.05 ttyp4 25534 304 0 21.64s 8k*sec 2io 0.04 1.00 d vers 673249191 0.05 ttyp4 25534 304 0 0.06s 17k*sec 2io 0.05 1.00 d make* 673249169 0.03 ttyp4 25534 304 0 21.76s 13k*sec 0io 0.02 1.00 d rm 673249191 0.02 ttyp4 25534 304 0 0.03s 5k*sec 0io 0.02 1.00 d tdpr 673249194 0.21 ttyr1 1177 304 0 0.74s 63k*sec 25io 0.10 1.00 d -- Tom Christiansen tchrist@convex.com convex!tchrist "So much mail, so little time."