[comp.unix.wizards] cmp

rbj@cmr.icst.nbs.gov (Root Boy Jim) (06/17/88)

? > While we're on the subject of efficiency, cmp is coded wrong. It should
? > first stat the two files to be compared. If the character count is different,
? > so are the files. And files tend to be different more often than the same.
? 
? This is *only* useful if the "-s" flag is being used, and it's not
? reading from stdin.  How often does that happen?  Isn't common usage
? "cmp foo bar", not "cmp -s foo bar"?
? 
? --keith

Well you have a good point, and force me to restate my case. Mostly
what I want to know is whether they are the same or different, not where,
and I want it to speak if different and be silent if they are identical.
Thus, I want the equivalent of a command like so:

	alias compare 'cmp -s \!^ \!$ || echo \!^ and \!$ differ'

Of course I am free to implement such a command if I wish. And often
times it would be much faster than vanilla cmp.  Once again we have
proved that the conclusions we reach depend on the premises we take
for granted.

This whole subject started from context diffs, so since I have your
ear bent I will bend it a bit more. Actually, this has to do with
recursive diffs.

1) Symlinks. Suppose I have two trees, say /usr/include and /usr/outclude,
   but each has a symlink sys pointing to the same place. Do I really
   want to follow them just to say the subtrees are identical? We can't
   use -h (how did -h get to stand for symlink anyway?), but maybe we
   can define YAO to {not,} follow symlinks.

2) Suppose they point different places? Follow or not, depending on option.

3) This sets us up for removing identical files in the first tree. I wrote
   a (n UGLY) script which does this; it is nontrivial and slow. Of course
   it would be aided and abetted by an "if -l file echo file is symlink"
   switch in csh. Maybe "if -h file..." :-)

4) Diff -r prints "Binary files X and Y differ", but always does a diff
   on source files. Often I would prefer "Source files X and Y differ".
   YAO, -q sez "just tell me they're different". More elegantly/foolishly,
   -p prog sez which prog to run instead of diff. I prefer -q.

5) Kudos for carefully distinguishing output messages by types:
   "Only in ...", "Files ... are identical", "Binary ... differ", etc.
   This allows me to "diff -r | grep ^X" where X is one of B, F, O.
   The addition of "Source files ... differ" would also be unique.
   A tiny nit: Perhaps the "Only" messages should say "Old file: $1/name"
   and "New file: $2/name". I'm not entirely satisfiled with those
   messages, but you get the idea.

6) Another symlink glitch: ls -F prints symlinks to directorys with a
   trailing '/', so there is no easy way to distinguish them from real
   directorys. How about a trailing '\' instead? More abstruse is
   printing a symlink that points to nowhere with a trailing '?', but
   I don't really care about that and it is extra work to do. The former
   idea I have come cherish since I hacked it in tho.

7) I don't ask for much, do I :-?

	(Root Boy) Jim Cottrell	<rbj@icst-cmr.arpa>
	National Bureau of Standards
	Flamer's Hotline: (301) 975-5688
	The opinions expressed are solely my own
	and do not reflect NBS policy or agreement
	Careful with that VAX Eugene!
	I'm having a BIG BANG THEORY!!

andrew@alice.UUCP (06/17/88)

on most implementations, cmp has two getchars (or getc's) in the inner loop.
we got a factor of five improvement by reading in blocks and using
memcmp.

jaw@eos.UUCP (James A. Woods) (06/22/88)

From article <7993@alice.UUCP>, by andrew@alice.UUCP:
> 
> 
> on most implementations, cmp has two getchars (or getc's) in the inner loop.
> we got a factor of five improvement by reading in blocks and using
> memcmp.

the 'cmp' on the cray two here is one of the rare unix commands which
is vectorized, though some poor soul had to code the loop as a fortran
subroutine (no vector C here).  it doesn't count lines (though nobody
cares, i don't believe this is inherently nonvectorizable).
you might think a regular byte-oriented 'cmp' would be i/o bound on
such a beast -- not true by a longshot; incidently this was one motivation
for my development of boyer/moore/gosper 'egrep' two years back.