chip@tct.uucp (Chip Salzenberg) (01/30/91)
According to brnstnd@kramden.acf.nyu.edu (Dan Bernstein): >Be serious, Karl. We're talking about seven simple commands in a >pipeline versus nearly 30 lines of perl to do the same job: namely, get >a list of all names where there are two different executables in PATH. So maybe Karl did it the long way. Here's my version of pathdup in Perl, and it's only ten lines long: ======================================================================== for $dir (split(/:/, $ENV{'PATH'})) { opendir(DIR, length($dir) ? $dir : ".") || next; while (defined($file = readdir(DIR))) { next if $file eq "." || $file eq ".." || !-x "$dir/$file"; $M{$file} = 1 if defined($P{$file}); $P{$file} .= "\0$dir/$file"; } closedir(DIR); } grep(grep(length && (print $_, "\n"), split(/\0/,$P{$_})), sort keys(%M)); ======================================================================== I even tested it. :-) >It's a hell of a lot simpler to let ls, sort, and uniq do the work than >to bother traversing directories, stat()ing files, and keeping arrays of >on thing or another by hand. Surely you agree. Except that your version also uses rev (what's that?) and BSD-specific options to ls. The simplest script in the world is useless to me if it doesn't work on my system. In contrast, my Perl script works on all OSs on which readdir() et al are available. If readdir() is not available, and if the ability to distinguish filenames with newlines is not important -- as is apparent from your willingness to use ls -- then `cd $dir && ls -a` may be used instead of opendir()/readdir()/closedir(). Your shell script is a marvel of tool usage. But to me it's a "dancing bear" script: What's amazing isn't that it's robust and portable -- it isn't -- but that it works at all. >Tom knows the shell script is a lot simpler ... Somewhat simpler (not a lot), but broken for my SysV environment. "Anything that works is better than anything that doesn't." --me -- Chip Salzenberg at Teltronics/TCT <chip@tct.uucp>, <uunet!pdn!tct!chip> "All this is conjecture of course, since I *only* post in the nude. Nothing comes between me and my t.b. Nothing." -- Bill Coderre
tchrist@convex.COM (Tom Christiansen) (01/30/91)
From the keyboard of chip@tct.uucp (Chip Salzenberg): :>Be serious, Karl. We're talking about seven simple commands in a :>pipeline versus nearly 30 lines of perl to do the same job: namely, get :>a list of all names where there are two different executables in PATH. : :So maybe Karl did it the long way. Here's my version of pathdup in :Perl, and it's only ten lines long: [ clever perl script deleted ] I think he was talking about my script, not Karl's. About two-thirds of those 30 lines were debugging, error messages, and added space for legibility. Since I've been criticized for verbosity for putting in such features, they've been duly excised. Chip's \0 stuff is sure the right way to go if you're that concerned. Mine may list things whose directories have spaces in them. I'm not sure how easy it is to discern files with \n's in them from separate entries in Chip's output. Here are 7 lines (although I only see 5 lines of code), and it does run pretty fast. Notice I don't count files linked together -- if you #comment out that code, the trailing ``|| $f{$_...'' part of line 4, the user time is cut in half. It'll still run much faster than the shell script without that. Try it -- the differnces are dramatic. 1 for $d (split(/:/, $ENV{'PATH'})) { 2 next if $seen{$d}++ || $d =~ /^\.?$/ || !chdir($d) || !opendir(DIR,'.'); 3 while (defined($_ = readdir(DIR))) { 4 $b{$_} .= " $d" unless /^\.{1,2}$/ || !(-x&&-f_) || $f{$_,(stat(_))[0,1]}++; 5 } 6 } 7 for (keys %b) { print $_, ":", $b{$_}, "\n" if rindex($b{$_},' '); } But I like the $files{$file,$dev,$ino} part (as it read originally) because I didn't want to count the same real file twice. I've got a lot linked in from /bin and /etc into /usr/bin and /usr/etc. Detecting this was one of the script's goals. Doing that in a shell script isn't always feasible. What about hard links? That's one of the major problems with these dancing-bear shell-script examples (wonderful phrase, Chip!): because of the munge-until-done strategy, it's hard to get feedback doing between the passes without kluges. Surely for some things the pipe and backquote stuff are great: here's an old tcsh alias of mine before I got into ksh shell functions: alias vall "vi '+/\!:1' `grep -l \!:1 *.[^oa]`" I wouldn't feel obliged to write this as a script. But for the more complicated tasks, such as here where we're looking at all the files in your path, finding dups, and weeding out false positives, the speed hits and kluginess become too much to tolerate. Maybe different people have different tolerance levels for gross kluges and slow code. I'm less tolerant of the latter than the former. :-) --tom -- "Hey, did you hear Stallman has replaced /vmunix with /vmunix.el? Now he can finally have the whole O/S built-in to his editor like he always wanted!" --me (Tom Christiansen <tchrist@convex.com>)
chip@tct.uucp (Chip Salzenberg) (01/31/91)
According to tchrist@convex.COM (Tom Christiansen): >I think he was talking about my script, not Karl's. He certainly was; sorry for the misunderstanding. >Chip's \0 stuff is sure the right way to go if you're that concerned. After I posted my script, I realized that _trailing_ \0 is easier to deal with, since the simplest output format is "s/\0/\n/". But I went back to the leading \0 when I changed my output format. >I'm not sure how easy it is to discern files with \n's in them from >separate entries in Chip's output. A directory that ends in newline could go undetected. Obviously I could octalize all the names before printing them, but that takes the output loop from grep() to for(), upping the line count by three. :-) >Here are 7 lines (although I only see 5 lines of code), and it does run >pretty fast. But you cheated by omitting closedir(DIR). >Notice I don't count files linked together. Ah, that feature is useful. But your implementation skips the useful heuristic of checking st_nlinks. Here's my latest version, still only ten lines long (with closedir()), and with a better output format ("command dir1 dir2"): =============================================================================== for $dir (split(/:/, $ENV{'PATH'})) { opendir(DIR, length($dir) ? $dir : ".") || next; while (defined($file = readdir(DIR))) { $_ = "$dir/$file"; $D{$file} .= "\0$dir" unless m#/\.\.?$# || !-f $_ || !-x _ || (stat(_))[3] > 1 && $F{$file,(stat(_))[0,1]}++; } closedir(DIR); } grep(((($x = $D{$_}) =~ s/\0/ /g) > 1 && print "$_\t$x\n"), sort keys(%D)); =============================================================================== Try getting that output format from a shell script... :-) -- Chip Salzenberg at Teltronics/TCT <chip@tct.uucp>, <uunet!pdn!tct!chip> "I want to mention that my opinions whether real or not are MY opinions." -- the inevitable William "Billy" Steinmetz
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (01/31/91)
In article <27A5C105.1019@tct.uucp> chip@tct.uucp (Chip Salzenberg) writes: > So maybe Karl did it the long way. Here's my version of pathdup in > Perl, and it's only ten lines long: Chip, you just typed ten lines of perl, including a few hundred characters, to achieve basically the same effect as $ ls -F `echo $PATH | tr : '\012'` | sed -n 's/\*$//p' | sort | uniq -d or, in csh, % ls -F $path | sed -n 's/\*$//p' | sort | uniq -d Are you sure you're spending your programming time wisely? ---Dan
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (01/31/91)
Tom and Chip now have packed 7-line and 10-line Perl scripts to achieve the effect of a half-line shell script. This comes after Tom's 36-line Perl script to accomplish the same thing as a 20-token, 100-character shell script. (At the end of this article I develop, step by step, an even simpler pipeline for the repeats-in-path problem). Folks, the tools available from the shell are simply more powerful than Perl's function calls. They don't cover every situation, but when they do the job cleanly, they provide much more compact solutions than Perl. In article <1991Jan30.055113.26485@convex.com> tchrist@convex.COM (Tom Christiansen) writes: > It'll still run much faster than the shell > script without that. Try it -- the differnces are dramatic. This is an outright lie. ls takes by far the bulk of the time in the shell scripts I've posted; ls's file tree walk code has been heavily optimized over the years; perl doesn't run faster than ls. I suspect Tom is comparing ls *with* -Fl to his script *without* stat or -x, and that's cheating. > But I like the $files{$file,$dev,$ino} part (as it read originally) > because I didn't want to count the same real file twice. This, Chip, is what you missed. Tom's original 36-line shell script does this; my 20-token shell script does it. > because of the munge-until-done > strategy, A shell script is not munged. A shell script grows. Take a sample problem: Find all names that correspond to two different executables in $path (csh). To build a shell script for this, imagine the data flowing from one filter to another. You need to look at all executables in $path? Fine, start with full information about every file in $path: ls -ilF $path (use L to go through symbolic links) Now extract the executables. ls marks them with a trailing *: ls -ilF $path | sed -n 's/\*$//p' Now eliminate repeated lines that talk about the same file. One utility both brings lines together and eliminates duplicates: ls -ilF $path | sed -n 's/\*$//p' | sort -u Now strip the extra information, leaving just the filenames of all different executables in $path: ls -ilF $path | sed -n 's/\*$//p' | sort -u | sed 's/.* //' Finally, sort into order and extract the repetitions: ls -ilF $path | sed -n 's/\*$//p' | sort -u | sed 's/.* //' | sort | uniq -d Compare this simple script to Tom's original 36-line monster. Was this really so hard to develop? Is ``munge until done'' really an accurate description of data-flow programming? Is this script such a maintenance nightmare? (To port it to sh, for instance, you replace $path with the standard invocation `echo $PATH | tr : '\012'`. Is that so painful?) This is just one of many strategies for attacking the problem. You build shell scripts naturally out of the data available and out of what you can transform that data into. I daresay there is no such simple strategy in Perl. ---Dan
karl@ficc.ferranti.com (Karl Lehenbauer) (01/31/91)
In article <1991Jan30.055113.26485@convex.com> tchrist@convex.COM (Tom Christiansen) writes: >I think he was talking about my script, not Karl's. About two-thirds of >those 30 lines were debugging, error messages, and added space for >legibility. Since I've been criticized for verbosity for putting >in such features, they've been duly excised. Yes, my Tcl script contained 15 lines of code. There were 21 comments and blank lines to improve readability and maintainability. ("Though a program be but three lines long, someday it will have to be maintained." -- Tao of Programming) Below, I have crushed it into six lines just to show it could be done. Yes, it still works. You can get it down to one line by grouping the remaining lines together with semicolons where the lines don't end with open curly brackets. Compare this to the alt.sources posting of pathdups.tcl and consider in general the hacked-out shell scripts you've seen and I think most people will agree that least_number_of_lines_of_code != most_easily_understood_program foreach dir [split : [getenv PATH]] { cd $dir foreach file [globok *] { if [info exists files($file)] { set files($file) [concat $files($file) $dir]; set dups($file) "" } else { set files($file) $dir } } } set index "" ; while {[next_element dups index file]} { print "$file => $files($file)\n" } -- -- "If it ain't too broke, don't fix it." -- me, with apologies to Bert Lantz
merlyn@iwarp.intel.com (Randal L. Schwartz) (02/01/91)
In article <27A72A04.3BCA@tct.uucp>, chip@tct (Chip Salzenberg) writes: | Here's my latest version, still only ten lines long (with closedir()), | and with a better output format ("command dir1 dir2"): | | =============================================================================== | for $dir (split(/:/, $ENV{'PATH'})) { | opendir(DIR, length($dir) ? $dir : ".") || next; | while (defined($file = readdir(DIR))) { | $_ = "$dir/$file"; | $D{$file} .= "\0$dir" unless m#/\.\.?$# || !-f $_ || !-x _ | || (stat(_))[3] > 1 && $F{$file,(stat(_))[0,1]}++; | } | closedir(DIR); | } | grep(((($x = $D{$_}) =~ s/\0/ /g) > 1 && print "$_\t$x\n"), sort keys(%D)); | =============================================================================== | | Try getting that output format from a shell script... :-) OK, here's the one I wrote sorta blind (after watching the Perl code spin by without paying much detail). It prints: e: /bin/e,/usr/bin/e /usr/ucb/e mail: /bin/mail,/usr/bin/mail /usr/ucb/mail on stock Sunos4.1. Note that the links are shown separated by commas. I didn't try to get it down into as few lines as possible. I save that for .signatures. :-) I think this version is a lot clearer about what it does. But maybe that's because *I* wrote it. :-) ################################################## #!/usr/bin/perl for $bin (split(/:/, $ENV{'PATH'})) { opendir(D,$bin) || next; for $base (readdir(D)) { $_ = "$bin/$base"; next unless -x && -f _; $exec{$base} .= "$_\0"; } closedir(D); } for (sort keys %exec) { @exec = split(/\0/,$exec{$_}); next unless @exec > 1; undef %list; for (@exec) { ($dev, $ino) = stat $_; $list{"$dev $ino"} .= "$_\0"; } @vals = sort values %list; next unless @vals > 1; print "$_:"; for (@vals) { chop; s/\0/,/g; print " $_"; } print "\n"; } ################################################## Just another Perl hacker, -- /=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\ | on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III | | merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn | \=Cute Quote: "Intel: putting the 'backward' in 'backward compatible'..."====/
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/01/91)
In article <LH497M@xds2.ferranti.com> karl@ficc.ferranti.com (Karl Lehenbauer) writes: > foreach dir [split : [getenv PATH]] { cd $dir > foreach file [globok *] { if [info exists files($file)] { > set files($file) [concat $files($file) $dir]; set dups($file) "" > } else { set files($file) $dir } } } > set index "" ; while {[next_element dups index file]} { > print "$file => $files($file)\n" } Ye gods. You don't handle links, you don't extract the executables, and this is all to avoid #!/bin/csh -f ls -ilLF $path | sed -n 's/\*$//p' | sort -u | sed 's/.* //' | sort | uniq -d Yes, it is a portability problem that you have to remove the L on machines without symbolic links. Yes, the script has a few limitations. But it does the job, it is within 15% of optimal efficiency on this machine, and it is far easier to write, maintain, and understand. ---Dan
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/02/91)
Summary: On a more normal machine (a DECsystem running Ultrix), Tom's Perl script is 6% faster than my csh script. However, it uses four times as much memory, six times as much I/O, and more page faults. Furthermore, where my script is just one line with six simple tools piped together, Tom's original script has 28 lines of code, and his most condensed (and hence incomprehensible) version is still 5 lines of code. Furthermore, Tom's script does not solve the problem as *he* stated it; my script does. So: Is it worth several times more space and several times more programmer effort to get a 6% speedup (and fail to solve the problem)? I think not. In article <1991Feb01.180524.26619@convex.com> tchrist@convex.COM (Tom Christiansen) writes: > Ok, here are the problems with the Dan's latest round of pipes: > 1) It runs considerably slower (percentage-wise) than the perl version > in both system and user times. Tom, your Convex is hardly the generic benchmark machine, and you're cheating by not pointing out memory use. Here are times from a DECsystem-5820 running Ultrix 4.0. Tom's Perl version: 2.3u 3.8s 0:20 30% 279+267k 166+25io 6pf+0w My csh version: 2.1u 4.6s 0:06 100% 74+104k 0+34io 5pf+0w A Sun produces similar results. Well, folks, look at those times. Tom's Perl version does run faster overall---0.4 seconds out of 6.7, a 6% difference. It also uses four times as much memory, six times as much I/O, and more page faults. Tom, you're also ignoring the most basic human engineering factors. My shell script is ONE LINE: one TRIVIAL line, with six obvious tools piped together in sequence. Your shortest Perl version has FIVE lines packed with unobvious, unreadable tricks and no obvious data flow; your shortest readable version has TWENTY-EIGHT lines of code. The shell script is trivial to write, trivial to maintain, and trivial to understand. The same cannot be said of your Perl versions. Sorry for shouting, but you keep ignoring this issue. Okay, folks, we have two scripts that do basically the same job. One of them runs 6% faster, but is five times longer and uses about four times more memory. How much memory do you have on your machine? Can you spare four times your old memory use? Are you willing to write one fifth as many programs just to get a 6% speed improvement? Are you sure that speed improvement will remain when you're swapping every two seconds? > 2) It doesn't list where the collisions occur, which > was, after all, my goal in writing it in the first place. BIG FUCKING DEAL. Your stupid script puts in all sorts of extra garbage, which is against every principle of tool design. You have ``which''; learn to use it. Learn to write code once instead of copying it into every single new program. > 3) It screws up on pathological filenames. I addressed this at length in another posting, which you missed since you and your kill file have been acting so manly. > 4) It gets some things truly wrong: > 6 -rwxr--r-- 1 tchrist 4542 Mar 4 1990 /mnt/tchrist/scripts/errlogd > 6 -rwxr--r-- 1 root 4865 Nov 19 16:58 /usr/adm/bin/errlogd > There's no collision there -- I can't execute the one in /usr/adm/bin > even with that in my path. Dan's version thinks I can, whereas mine > and Randal's and Chip's all know better. I dispute that claim. When you say ``duplicate filenames of different executables in PATH,'' the average admin expects a useful list, something that might point out problems or inconsistencies in the system. Should this list depend on which user happens to execute it (or on which directory he happens to be in)? I think not, for the same reason that I don't like ``which'' output relative to the current directory---really for the same reason that I don't like the system-level concept of a single current working directory. To put it simply: The problem statement doesn't depend on the current user, so the solution shouldn't either. Now I know you prefer to be literal rather than sensible when being sensible would lose an argument. But if we take your original problem definition, YOUR script is truly WRONG. YOU defined the problem as only talking about ``executables,'' not ``things that the current user can execute.'' /usr/adm/bin/errlogd is an executable, whether you can execute it or not. ---Dan
tchrist@convex.COM (Tom Christiansen) (02/02/91)
Gods, it's not enough that I should kill the guy -- two people have actually *MAILED* me his last posting. ***Sigh.*** Ok, here are the problems with the Dan's latest round of pipes: 1) It runs considerably slower (percentage-wise) than the perl version in both system and user times. Witness: $ time perl clash.tom > /dev/null 5.286052 real 2.546462 user 2.135102 sys $ time csh -f clash.dan > /dev/null 8.468490 real 4.665696 user 3.077168 sys 2) It doesn't list where the collisions occur, which was, after all, my goal in writing it in the first place. 3) It screws up on pathological filenames. 4) It gets some things truly wrong: 6 -rwxr--r-- 1 tchrist 4542 Mar 4 1990 /mnt/tchrist/scripts/errlogd 6 -rwxr--r-- 1 root 4865 Nov 19 16:58 /usr/adm/bin/errlogd There's no collision there -- I can't execute the one in /usr/adm/bin even with that in my path. Dan's version thinks I can, whereas mine and Randal's and Chip's all know better. I didn't check the TCL version. I can't figure out what to do with this thread. I really wish it would die. [So don't post to it, Tom. Yeah, yeah.] I readily agree that it's cluttering this newsgroup. I'm going to send it to alt.religion.computers, since it's basically a religious issue, so doesn't belong in comp.lang.perl; sending it there would be like arguing whether women should exist in soc.women -- none too cool. If they don't like it there, they can redirect to alt.flame, I guess. There is actually a thread going there in alt.religion.computers already about the pipes vs programming approach. --tom PS: Go check out comp.org.usenix for Lori's amusing contest posting. -- "Hey, did you hear Stallman has replaced /vmunix with /vmunix.el? Now he can finally have the whole O/S built-in to his editor like he always wanted!" --me (Tom Christiansen <tchrist@convex.com>)
shaver@convex.com (Dave Shaver) (02/05/91)
Dan Bernstein writes: >Summary: On a more normal machine (a DECsystem running Ultrix), Tom's >Perl script is 6% faster than my csh script. >[followed by some interesting stuff mixed with useless flames.] Come on, guys. Each solution has value. Move this flame fest totally out of alt.sources.d. I sick of listening to you bicker.