[alt.sources.d] "Simple" but non-portable == useless.

chip@tct.uucp (Chip Salzenberg) (01/30/91)

According to brnstnd@kramden.acf.nyu.edu (Dan Bernstein):
>Be serious, Karl. We're talking about seven simple commands in a
>pipeline versus nearly 30 lines of perl to do the same job: namely, get
>a list of all names where there are two different executables in PATH.

So maybe Karl did it the long way.  Here's my version of pathdup in
Perl, and it's only ten lines long:

========================================================================
for $dir (split(/:/, $ENV{'PATH'})) {
	opendir(DIR, length($dir) ? $dir : ".") || next;
	while (defined($file = readdir(DIR))) {
		next if $file eq "." || $file eq ".." || !-x "$dir/$file";
		$M{$file} = 1 if defined($P{$file});
		$P{$file} .= "\0$dir/$file";
	}
	closedir(DIR);
}
grep(grep(length && (print $_, "\n"), split(/\0/,$P{$_})), sort keys(%M));
========================================================================

I even tested it.  :-)

>It's a hell of a lot simpler to let ls, sort, and uniq do the work than
>to bother traversing directories, stat()ing files, and keeping arrays of
>on thing or another by hand. Surely you agree.

Except that your version also uses rev (what's that?) and BSD-specific
options to ls.  The simplest script in the world is useless to me if
it doesn't work on my system.

In contrast, my Perl script works on all OSs on which readdir() et al
are available.  If readdir() is not available, and if the ability to
distinguish filenames with newlines is not important -- as is apparent
from your willingness to use ls -- then `cd $dir && ls -a` may be used
instead of opendir()/readdir()/closedir().

Your shell script is a marvel of tool usage.  But to me it's a
"dancing bear" script: What's amazing isn't that it's robust and
portable -- it isn't -- but that it works at all.

>Tom knows the shell script is a lot simpler ...

Somewhat simpler (not a lot), but broken for my SysV environment.
"Anything that works is better than anything that doesn't."  --me
-- 
Chip Salzenberg at Teltronics/TCT     <chip@tct.uucp>, <uunet!pdn!tct!chip>
   "All this is conjecture of course, since I *only* post in the nude.
    Nothing comes between me and my t.b.  Nothing."   -- Bill Coderre

tchrist@convex.COM (Tom Christiansen) (01/30/91)

From the keyboard of chip@tct.uucp (Chip Salzenberg):
:>Be serious, Karl. We're talking about seven simple commands in a
:>pipeline versus nearly 30 lines of perl to do the same job: namely, get
:>a list of all names where there are two different executables in PATH.
:
:So maybe Karl did it the long way.  Here's my version of pathdup in
:Perl, and it's only ten lines long:

[ clever perl script deleted ]

I think he was talking about my script, not Karl's.  About two-thirds of
those 30 lines were debugging, error messages, and added space for
legibility.  Since I've been criticized for verbosity for putting
in such features, they've been duly excised.

Chip's \0 stuff is sure the right way to go if you're that concerned.  Mine
may list things whose directories have spaces in them.  I'm not sure how
easy it is to discern files with \n's in them from separate entries in
Chip's output.

Here are 7 lines (although I only see 5 lines of code), and it does run
pretty fast.  Notice I don't count files linked together -- if you
#comment out that code, the trailing ``|| $f{$_...'' part of line 4, the
user time is cut in half.  It'll still run much faster than the shell
script without that.  Try it -- the differnces are dramatic.

1 for $d (split(/:/, $ENV{'PATH'})) {
2  next if $seen{$d}++ || $d =~ /^\.?$/ || !chdir($d) || !opendir(DIR,'.'); 
3  while (defined($_ = readdir(DIR))) {
4   $b{$_} .= " $d" unless /^\.{1,2}$/ || !(-x&&-f_) || $f{$_,(stat(_))[0,1]}++;
5  } 
6 } 
7 for (keys %b) { print $_, ":", $b{$_}, "\n" if rindex($b{$_},' '); }

But I like the $files{$file,$dev,$ino} part (as it read originally)
because I didn't want to count the same real file twice.  I've got a lot
linked in from /bin and /etc into /usr/bin and /usr/etc.  Detecting this
was one of the script's goals.  Doing that in a shell script isn't always
feasible.  What about hard links?

That's one of the major problems with these dancing-bear shell-script
examples (wonderful phrase, Chip!): because of the munge-until-done
strategy, it's hard to get feedback doing between the passes without
kluges.    Surely for some things the pipe and backquote stuff are great:
here's an old tcsh alias of mine before I got into ksh shell functions:

    alias vall	"vi '+/\!:1' `grep -l \!:1 *.[^oa]`"

I wouldn't feel obliged to write this as a script.  But for the more
complicated tasks, such as here where we're looking at all the files in
your path, finding dups, and weeding out false positives, the speed hits
and kluginess become too much to tolerate.  Maybe different people have
different tolerance levels for gross kluges and slow code.  I'm less
tolerant of the latter than the former. :-)

--tom
--
"Hey, did you hear Stallman has replaced /vmunix with /vmunix.el?  Now
 he can finally have the whole O/S built-in to his editor like he
 always wanted!" --me (Tom Christiansen <tchrist@convex.com>)

chip@tct.uucp (Chip Salzenberg) (01/31/91)

According to tchrist@convex.COM (Tom Christiansen):
>I think he was talking about my script, not Karl's.

He certainly was; sorry for the misunderstanding.

>Chip's \0 stuff is sure the right way to go if you're that concerned.

After I posted my script, I realized that _trailing_ \0 is easier to
deal with, since the simplest output format is "s/\0/\n/".  But I went
back to the leading \0 when I changed my output format.

>I'm not sure how easy it is to discern files with \n's in them from
>separate entries in Chip's output.

A directory that ends in newline could go undetected.  Obviously I
could octalize all the names before printing them, but that takes the
output loop from grep() to for(), upping the line count by three. :-)

>Here are 7 lines (although I only see 5 lines of code), and it does run
>pretty fast.

But you cheated by omitting closedir(DIR).

>Notice I don't count files linked together.

Ah, that feature is useful.  But your implementation skips the useful
heuristic of checking st_nlinks.

Here's my latest version, still only ten lines long (with closedir()),
and with a better output format ("command dir1 dir2"):

===============================================================================
for $dir (split(/:/, $ENV{'PATH'})) {
	opendir(DIR, length($dir) ? $dir : ".") || next;
	while (defined($file = readdir(DIR))) {
		$_ = "$dir/$file";
		$D{$file} .= "\0$dir" unless m#/\.\.?$# || !-f $_ || !-x _
		    || (stat(_))[3] > 1 && $F{$file,(stat(_))[0,1]}++;
	}
	closedir(DIR);
}
grep(((($x = $D{$_}) =~ s/\0/ /g) > 1 && print "$_\t$x\n"), sort keys(%D));
===============================================================================

Try getting that output format from a shell script...  :-)
-- 
Chip Salzenberg at Teltronics/TCT     <chip@tct.uucp>, <uunet!pdn!tct!chip>
 "I want to mention that my opinions whether real or not are MY opinions."
             -- the inevitable William "Billy" Steinmetz

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (01/31/91)

In article <27A5C105.1019@tct.uucp> chip@tct.uucp (Chip Salzenberg) writes:
> So maybe Karl did it the long way.  Here's my version of pathdup in
> Perl, and it's only ten lines long:

Chip, you just typed ten lines of perl, including a few hundred
characters, to achieve basically the same effect as

  $ ls -F `echo $PATH | tr : '\012'` | sed -n 's/\*$//p' | sort | uniq -d

or, in csh,

  % ls -F $path | sed -n 's/\*$//p' | sort | uniq -d

Are you sure you're spending your programming time wisely?

---Dan

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (01/31/91)

Tom and Chip now have packed 7-line and 10-line Perl scripts to achieve
the effect of a half-line shell script. This comes after Tom's 36-line
Perl script to accomplish the same thing as a 20-token, 100-character
shell script. (At the end of this article I develop, step by step, an
even simpler pipeline for the repeats-in-path problem).

Folks, the tools available from the shell are simply more powerful than
Perl's function calls. They don't cover every situation, but when they
do the job cleanly, they provide much more compact solutions than Perl.

In article <1991Jan30.055113.26485@convex.com> tchrist@convex.COM (Tom Christiansen) writes:
> It'll still run much faster than the shell
> script without that.  Try it -- the differnces are dramatic.

This is an outright lie. ls takes by far the bulk of the time in the
shell scripts I've posted; ls's file tree walk code has been heavily
optimized over the years; perl doesn't run faster than ls. I suspect Tom
is comparing ls *with* -Fl to his script *without* stat or -x, and
that's cheating.

> But I like the $files{$file,$dev,$ino} part (as it read originally)
> because I didn't want to count the same real file twice.

This, Chip, is what you missed. Tom's original 36-line shell script does
this; my 20-token shell script does it.

> because of the munge-until-done
> strategy,

A shell script is not munged. A shell script grows. Take a sample
problem: Find all names that correspond to two different executables in
$path (csh). To build a shell script for this, imagine the data flowing
from one filter to another. You need to look at all executables in
$path? Fine, start with full information about every file in $path:

  ls -ilF $path   (use L to go through symbolic links)

Now extract the executables. ls marks them with a trailing *:

  ls -ilF $path | sed -n 's/\*$//p'

Now eliminate repeated lines that talk about the same file. One utility
both brings lines together and eliminates duplicates:

  ls -ilF $path | sed -n 's/\*$//p' | sort -u

Now strip the extra information, leaving just the filenames of all
different executables in $path:

  ls -ilF $path | sed -n 's/\*$//p' | sort -u | sed 's/.* //'

Finally, sort into order and extract the repetitions:

  ls -ilF $path | sed -n 's/\*$//p' | sort -u | sed 's/.* //' | sort | uniq -d

Compare this simple script to Tom's original 36-line monster. Was this
really so hard to develop? Is ``munge until done'' really an accurate
description of data-flow programming? Is this script such a maintenance
nightmare? (To port it to sh, for instance, you replace $path with the
standard invocation `echo $PATH | tr : '\012'`. Is that so painful?)

This is just one of many strategies for attacking the problem. You build
shell scripts naturally out of the data available and out of what you
can transform that data into. I daresay there is no such simple strategy
in Perl.

---Dan

karl@ficc.ferranti.com (Karl Lehenbauer) (01/31/91)

In article <1991Jan30.055113.26485@convex.com> tchrist@convex.COM (Tom Christiansen) writes:
>I think he was talking about my script, not Karl's.  About two-thirds of
>those 30 lines were debugging, error messages, and added space for
>legibility.  Since I've been criticized for verbosity for putting
>in such features, they've been duly excised.

Yes, my Tcl script contained 15 lines of code. There were 21 comments and blank
lines to improve readability and maintainability.  ("Though a program be but
three lines long, someday it will have to be maintained." -- Tao of Programming)

Below, I have crushed it into six lines just to show it could be done.  Yes, 
it still works.  You can get it down to one line by grouping the remaining 
lines together with semicolons where the lines don't end with open curly
brackets.  Compare this to the alt.sources posting of pathdups.tcl and consider
in general the hacked-out shell scripts you've seen and I think most people
will agree that
	least_number_of_lines_of_code != most_easily_understood_program

foreach dir [split : [getenv PATH]] { cd $dir
  foreach file [globok *] { if [info exists files($file)] {
      set files($file) [concat $files($file) $dir]; set dups($file) ""
    } else { set files($file) $dir } } }
set index "" ; while {[next_element dups index file]} {
	print "$file => $files($file)\n" }
-- 
-- "If it ain't too broke, don't fix it."  -- me, with apologies to Bert Lantz

merlyn@iwarp.intel.com (Randal L. Schwartz) (02/01/91)

In article <27A72A04.3BCA@tct.uucp>, chip@tct (Chip Salzenberg) writes:
| Here's my latest version, still only ten lines long (with closedir()),
| and with a better output format ("command dir1 dir2"):
| 
| ===============================================================================
| for $dir (split(/:/, $ENV{'PATH'})) {
| 	opendir(DIR, length($dir) ? $dir : ".") || next;
| 	while (defined($file = readdir(DIR))) {
| 		$_ = "$dir/$file";
| 		$D{$file} .= "\0$dir" unless m#/\.\.?$# || !-f $_ || !-x _
| 		    || (stat(_))[3] > 1 && $F{$file,(stat(_))[0,1]}++;
| 	}
| 	closedir(DIR);
| }
| grep(((($x = $D{$_}) =~ s/\0/ /g) > 1 && print "$_\t$x\n"), sort keys(%D));
| ===============================================================================
| 
| Try getting that output format from a shell script...  :-)

OK, here's the one I wrote sorta blind (after watching the Perl code
spin by without paying much detail).  It prints:

	e: /bin/e,/usr/bin/e /usr/ucb/e
	mail: /bin/mail,/usr/bin/mail /usr/ucb/mail

on stock Sunos4.1.  Note that the links are shown separated by commas.
I didn't try to get it down into as few lines as possible.  I save
that for .signatures. :-)

I think this version is a lot clearer about what it does.  But maybe
that's because *I* wrote it. :-)

##################################################
#!/usr/bin/perl

for $bin (split(/:/, $ENV{'PATH'})) {
	opendir(D,$bin) || next;
	for $base (readdir(D)) {
		$_ = "$bin/$base";
		next unless -x && -f _;
		$exec{$base} .= "$_\0";
	}
	closedir(D);
}

for (sort keys %exec) {
	@exec = split(/\0/,$exec{$_});
	next unless @exec > 1;
	undef %list;
	for (@exec) {
		($dev, $ino) = stat $_;
		$list{"$dev $ino"} .= "$_\0";
	}
	@vals = sort values %list;
	next unless @vals > 1;
	print "$_:";
	for (@vals) {
		chop;
		s/\0/,/g;
		print " $_";
	}
	print "\n";
}
##################################################

Just another Perl hacker,
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Intel: putting the 'backward' in 'backward compatible'..."====/

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/01/91)

In article <LH497M@xds2.ferranti.com> karl@ficc.ferranti.com (Karl Lehenbauer) writes:
> foreach dir [split : [getenv PATH]] { cd $dir
>   foreach file [globok *] { if [info exists files($file)] {
>       set files($file) [concat $files($file) $dir]; set dups($file) ""
>     } else { set files($file) $dir } } }
> set index "" ; while {[next_element dups index file]} {
> 	print "$file => $files($file)\n" }

Ye gods. You don't handle links, you don't extract the executables, and
this is all to avoid

  #!/bin/csh -f
  ls -ilLF $path | sed -n 's/\*$//p' | sort -u | sed 's/.* //' | sort | uniq -d

Yes, it is a portability problem that you have to remove the L on
machines without symbolic links. Yes, the script has a few limitations.
But it does the job, it is within 15% of optimal efficiency on this
machine, and it is far easier to write, maintain, and understand.

---Dan

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/02/91)

Summary: On a more normal machine (a DECsystem running Ultrix), Tom's
Perl script is 6% faster than my csh script. However, it uses four times
as much memory, six times as much I/O, and more page faults. Furthermore,
where my script is just one line with six simple tools piped together,
Tom's original script has 28 lines of code, and his most condensed (and
hence incomprehensible) version is still 5 lines of code. Furthermore,
Tom's script does not solve the problem as *he* stated it; my script
does.

So: Is it worth several times more space and several times more
programmer effort to get a 6% speedup (and fail to solve the problem)?
I think not.

In article <1991Feb01.180524.26619@convex.com> tchrist@convex.COM (Tom Christiansen) writes:
> Ok, here are the problems with the Dan's latest round of pipes:
> 1) It runs considerably slower (percentage-wise) than the perl version 
>    in both system and user times.

Tom, your Convex is hardly the generic benchmark machine, and you're
cheating by not pointing out memory use. Here are times from a
DECsystem-5820 running Ultrix 4.0.

  Tom's Perl version: 2.3u 3.8s 0:20 30% 279+267k 166+25io 6pf+0w
  My csh version: 2.1u 4.6s 0:06 100% 74+104k 0+34io 5pf+0w

A Sun produces similar results.

Well, folks, look at those times. Tom's Perl version does run faster
overall---0.4 seconds out of 6.7, a 6% difference. It also uses four
times as much memory, six times as much I/O, and more page faults.

Tom, you're also ignoring the most basic human engineering factors. My
shell script is ONE LINE: one TRIVIAL line, with six obvious tools piped
together in sequence. Your shortest Perl version has FIVE lines packed
with unobvious, unreadable tricks and no obvious data flow; your
shortest readable version has TWENTY-EIGHT lines of code. The shell
script is trivial to write, trivial to maintain, and trivial to
understand. The same cannot be said of your Perl versions. Sorry for
shouting, but you keep ignoring this issue.

Okay, folks, we have two scripts that do basically the same job. One of
them runs 6% faster, but is five times longer and uses about four times
more memory. How much memory do you have on your machine? Can you spare
four times your old memory use? Are you willing to write one fifth as
many programs just to get a 6% speed improvement? Are you sure that
speed improvement will remain when you're swapping every two seconds?

> 2) It doesn't list where the collisions occur, which 
>    was, after all, my goal in writing it in the first place.

BIG FUCKING DEAL. Your stupid script puts in all sorts of extra garbage,
which is against every principle of tool design. You have ``which'';
learn to use it. Learn to write code once instead of copying it into
every single new program.

> 3) It screws up on pathological filenames.

I addressed this at length in another posting, which you missed since
you and your kill file have been acting so manly.

> 4) It gets some things truly wrong: 
>    6 -rwxr--r--  1 tchrist      4542 Mar  4  1990 /mnt/tchrist/scripts/errlogd
>    6 -rwxr--r--  1 root         4865 Nov 19 16:58 /usr/adm/bin/errlogd
>    There's no collision there -- I can't execute the one in /usr/adm/bin
>    even with that in my path.  Dan's version thinks I can, whereas mine 
>    and Randal's and Chip's all know better.

I dispute that claim. When you say ``duplicate filenames of different
executables in PATH,'' the average admin expects a useful list,
something that might point out problems or inconsistencies in the
system. Should this list depend on which user happens to execute it (or
on which directory he happens to be in)? I think not, for the same
reason that I don't like ``which'' output relative to the current
directory---really for the same reason that I don't like the
system-level concept of a single current working directory. To put it
simply: The problem statement doesn't depend on the current user, so the
solution shouldn't either.

Now I know you prefer to be literal rather than sensible when being
sensible would lose an argument. But if we take your original problem
definition, YOUR script is truly WRONG. YOU defined the problem as only
talking about ``executables,'' not ``things that the current user can
execute.'' /usr/adm/bin/errlogd is an executable, whether you can
execute it or not.

---Dan

tchrist@convex.COM (Tom Christiansen) (02/02/91)

Gods, it's not enough that I should kill the guy -- two people have
actually *MAILED* me his last posting.  ***Sigh.***

Ok, here are the problems with the Dan's latest round of pipes:

1) It runs considerably slower (percentage-wise) than the perl version 
   in both system and user times.  Witness:

    $ time perl clash.tom > /dev/null
        5.286052 real        2.546462 user        2.135102 sys
    $ time csh -f clash.dan > /dev/null
        8.468490 real        4.665696 user        3.077168 sys

2) It doesn't list where the collisions occur, which 
   was, after all, my goal in writing it in the first place.

3) It screws up on pathological filenames.

4) It gets some things truly wrong: 

   6 -rwxr--r--  1 tchrist      4542 Mar  4  1990 /mnt/tchrist/scripts/errlogd
   6 -rwxr--r--  1 root         4865 Nov 19 16:58 /usr/adm/bin/errlogd

   There's no collision there -- I can't execute the one in /usr/adm/bin
   even with that in my path.  Dan's version thinks I can, whereas mine 
   and Randal's and Chip's all know better.  I didn't check the TCL version.

I can't figure out what to do with this thread.  I really wish it would
die.  [So don't post to it, Tom.  Yeah, yeah.]  I readily agree that it's
cluttering this newsgroup.  I'm going to send it to alt.religion.computers, 
since it's basically a religious issue, so doesn't belong in comp.lang.perl;
sending it there would be like arguing whether women should exist in
soc.women -- none too cool.  If they don't like it there, they can redirect
to alt.flame, I guess.  There is actually a thread going there in 
alt.religion.computers already about the pipes vs programming approach.

--tom

PS:  Go check out comp.org.usenix for Lori's amusing contest posting.  
--
"Hey, did you hear Stallman has replaced /vmunix with /vmunix.el?  Now
 he can finally have the whole O/S built-in to his editor like he
 always wanted!" --me (Tom Christiansen <tchrist@convex.com>)

shaver@convex.com (Dave Shaver) (02/05/91)

Dan Bernstein writes:
>Summary: On a more normal machine (a DECsystem running Ultrix), Tom's
>Perl script is 6% faster than my csh script.
>[followed by some interesting stuff mixed with useless flames.]

Come on, guys.  Each solution has value.  Move this flame fest totally
out of alt.sources.d.  I sick of listening to you bicker.