[comp.unix.questions] Unix diff

@RELAY.CS.NET,@humus.huji.ac.IL:amoss@batata.bitne (Amos Shapira) (01/13/88)

Does any one knows where to find a description of the algorithm
used in Unix diff(1)?
I know that it was invented by one named Harold Stone, nothing more.

Thanks in advance,
--Amos Shapira

P.S.	If you think that there is another news group that might
	be able to help me then please send me a message.

gwyn@brl-smoke.ARPA (Doug Gwyn ) (01/13/88)

In article <11229@brl-adm.ARPA> @RELAY.CS.NET,@humus.huji.ac.IL:amoss@batata.bitne (Amos Shapira) writes:
>Does any one knows where to find a description of the algorithm
>used in Unix diff(1)?

It is described in Bell Labs CSTR #41, "An Algorithm for Differential
File Comparison" by J. W. Hunt and M. D. McIlroy (June 1976).
Unfortunately this seems to be out of print.  I think there were some
errors in the report, too.  I attempted to rewrite "diff" once, and
found some minor speed improvements that could be made to the stock
UNIX version.  The comments in the UNIX version were fairly accurate
and complete, and they're probably more useful than the CSTR anyway.

I seem to recall a CACM article some time in the past several years
that presented a supposedly better diff algorithm, and it may have
also discussed the UNIX diff algorithm.

g-rh@cca.CCA.COM (Richard Harter) (01/14/88)

In article <7068@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
>In article <11229@brl-adm.ARPA> @RELAY.CS.NET,@humus.huji.ac.IL:amoss@batata.bitne (Amos Shapira) writes:
>>Does any one knows where to find a description of the algorithm
>>used in Unix diff(1)?
>
>It is described in Bell Labs CSTR #41, "An Algorithm for Differential
>File Comparison" by J. W. Hunt and M. D. McIlroy (June 1976).
>Unfortunately this seems to be out of print.  I think there were some
>errors in the report, too.  I attempted to rewrite "diff" once, and
>found some minor speed improvements that could be made to the stock
>UNIX version.  The comments in the UNIX version were fairly accurate
>and complete, and they're probably more useful than the CSTR anyway.
>
>I seem to recall a CACM article some time in the past several years
>that presented a supposedly better diff algorithm, and it may have
>also discussed the UNIX diff algorithm.

	I used to have a copy of the Bell report, which of course had
disappeared when I had occasion to actually write a diff program.  My
recollection is that the original is rather obscure.  However the
essential idea is fairly simple, and has four steps:

(1)	Hash each line in each file to produce an array of integers.
(2)	Sort the two arrays of hash integers, retaining pointers back
	to the original lines.
(3)	Compare the two lists and extract the longest common ascending
	subsequence.
(4)	Relate the l.a.s. back to the original files.

	There is a straightforward algorithm which is mathematically
rather pretty for step three -- if the hash integers are all unique.
The tricky part is what to do when there are duplicates, i.e. multiple
occurences of the same text line in the file.  The solution I used was
to ignore the possibility of mis-assignment of duplicates during the
las pass and do a gap fill afterwards; I don't know diff handles this.

	This was written as part of a larger program, so I don't have
a comparison of 'mydiff' versus diff.  However I did do timing analyses
on 'mydiff'.  It turns out that the critical part of the code is the
sort.  In this case, the arrays being sorted are uniformly distributed
numbers, and a histogram math sort wins, since you can write an O(n)
sort.  The l.c.a.s code can be made O(n) is space and time.  The gap
fill code is very fast but is tricky to get right.

	The sort that I wrote, by the way, has a line of code that
breaks the optimizer on the Masscomp C compiler (unless they have
finally gotten a fix in on it.)  It runs as follows

for (i=0;i<n;i++) y[c[*ir1++]++] = *ir2++;

That little gem is a one pass sort loop!  (I won't try to explain it.)
Apparently this particular optimizer gets confused about the levels
of indirection.

	In the original report, I believe that they assumed that
the probability of two lines hashing to the same hash value was so
small that one didn't have to go back afterwards and check.  I don't
know if this assumption is still retained in diff.  The line hashing
(and recheck if you do it) use signifigant amounts of time but are
still dominated by the sort time.
-- 

In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
	Richard Harter, SMDS  Inc.

ok@quintus.UUCP (Richard A. O'Keefe) (01/14/88)

There is a public domain version of diff(1) coded in C that you can
get from a comp.sources.misc archive.

	Subject: v02i001: Public domain diff
	Keywords: diff
	X-Archive: comp.sources.misc/8801/1
	Comp.sources.misc: Volume 2, Issue 1
	Archive-Name: pd-diff
	Submitted-By: blarson@skat.usc.edu (Bob Larson)

I believe it started life as part of the DECUS goodies.  I have it,
but haven't installed it yet.

The trouble with diff(1) is that it assumes that the only changes are
insertions and deletions.  So if you swap two functions, it reports
two changes.  The method which appeared in CACM can spot moves as well
as insertions and deletions.  I have a Pascal implementation of that
method, which works well enough to use but not well enough to publish.
There are some things you need that the article didn't explain.

jaw@aurora.UUCP (James A. Woods) (01/15/88)

also investigate the webb miller / eugene myers code presented in
"a file comparison program", software practice and experience,
november 1985. 

they (miller was acm algorithms editor, now at univ. of arizona)
claim a factor of four speed increase over standard 'diff' for
typical files.

schwartz@gondor.cs.psu.edu (Scott E. Schwartz) (01/16/88)

In article <1571@aurora.UUCP> jaw@aurora.UUCP (James A. Woods) writes:
>also investigate the webb miller / eugene myers code presented in
>"a file comparison program", software practice and experience,
>november 1985. 
>
>they (miller was acm algorithms editor, now at univ. of arizona)
>claim a factor of four speed increase over standard 'diff' for
>typical files.

Webb is actually at Penn State.  Gene Myers is at Arizona, except that
he is here this year.

Recently they came up with an enhancement to their algorythm
("Haslock's refinement") that doubles (in the worst case) the speed
of the original algorythm.  Best case (in which all edits are insertions or
are all deletions) runs in linear time.

To be fair, there are cases where diff is faster.  Diff does best when
the longest common subsequence is short, i.e. lots of diffs.  Fcomp
(as they call it) does best when the edit script is short, i.e. few diffs.

-- Scott Schwartz            schwartz@gondor.cs.psu.edu

g-rh@cca.CCA.COM (Richard Harter) (01/16/88)

In article <3219@psuvax1.psu.edu> schwartz@gondor.cs.psu.edu (Scott E. Schwartz) writes:
>>also investigate the webb miller / eugene myers code presented in
>
>Webb is actually at Penn State.  Gene Myers is at Arizona, except that
>he is here this year.
>
>Recently they came up with an enhancement to their algorythm
>("Haslock's refinement") that doubles (in the worst case) the speed
>of the original algorythm.  Best case (in which all edits are insertions or
>are all deletions) runs in linear time.
>
>To be fair, there are cases where diff is faster.  Diff does best when
>the longest common subsequence is short, i.e. lots of diffs.  Fcomp
>(as they call it) does best when the edit script is short, i.e. few diffs.

But, but, but...  Finding the longest common subsequence is a linear
process.  Or at least it is sufficiently close to being linear that
it makes no difference.  I don't know how diff does it or fcomp does
it, but the algorithm that I use builds a ordered list of intermediate
subsequence heads.  Each new item goes onto a list head.  The trick to
making it quasi-linear is to do an interpolation search when placing
a new item into the list of intermediates.  [This isn't really linear,
but, given the conditions, might as well be.]

As I remarked earlier, my experience is that the time to find the 
longest common subsequence is dominated by the sort time, even if a
linear sort if being used.  (By about a factor of 4 to 1, as I recall.)
Am I missing something?  Do diff and/or fcomp bypass the sort?
-- 

In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
	Richard Harter, SMDS  Inc.

blm@cxsea.UUCP (01/16/88)

James A. Woods (jaw@aurora.UUCP) writes:
|also investigate the webb miller / eugene myers code presented in
|"a file comparison program", software practice and experience,
|november 1985. 
|they (miller was acm algorithms editor, now at univ. of arizona)
|claim a factor of four speed increase over standard 'diff' for
|typical files.


I've got a copy of this article, and entered the C implementation
included in the article (OK, I had the secretary here enter it).
Anyways, it does appear to be quite a bit faster than diff (the System V,
Release 2 diff).  The algorithm is a nice algorithm, but the
implementation would need to be tweeked some before I would consider it
usable everyday.  (Note that this should not be construed as a flame
agains the authors of the article.  The implementation is
straight-forward, and demonstrates the algorithm nicely.)  For instance,
it uses a LOT of memory, and when I profiled it, about 94% of the time
was spent in malloc!

One of the things I found most interesting about the algorithm is that it
is basically the same algorithm that Gosling Emacs uses to calculate
optimal screen updates!

If someone plans on implementing a public-domain diff supporting -b and
-c, this algorithm is definitely worth looking at.  Just don't use the
implementation in the article, or spend a lot of time optimizing it and
decreasing it's memory usage.

-- 
Brian L. Matthews                               "A power tool is not a toy.
...{mnetor,uw-beaver!ssc-vax}!cxsea!blm          Unix is a power tool."
+1 206 251 6811
Computer X Inc. - a division of Motorola New Enterprises

chip@ateng.UUCP (Chip Salzenberg) (01/19/88)

In article <532@cresswell.quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>The trouble with diff(1) is that it assumes that the only changes are
>insertions and deletions.  So if you swap two functions, it reports
>two changes.  The method which appeared in CACM can spot moves as well
>as insertions and deletions.

The "hdiff" program does move detection.  It's in the comp.sources.unix
archives.
-- 
Chip Salzenberg                 UUCP: "{codas,uunet}!ateng!chip"
A T Engineering                 My employer's opinions are a trade secret.
    "Anything that works is better than anything that doesn't."  -- me

schwartz@gondor.cs.psu.edu (Scott E. Schwartz) (01/23/88)

In article <23307@cca.CCA.COM> g-rh@CCA.CCA.COM.UUCP (Richard Harter) writes:
>But, but, but...  Finding the longest common subsequence is a linear
>process.  Or at least it is sufficiently close to being linear that
>it makes no difference.  I don't know how diff does it or fcomp does
>it, but the algorithm that I use builds a ordered list of intermediate
>subsequence heads.  Each new item goes onto a list head.  The trick to
>making it quasi-linear is to do an interpolation search when placing
>a new item into the list of intermediates.  [This isn't really linear,
>but, given the conditions, might as well be.]

I don't know if finding the longest common subsequence is a linear process,
in general.  All of the published algorythms I've seen have some sort of
dependency on the input data that stops one from making that claim.

Which conditions are you referring to?  Also, when you say quasi-linear,
what does that really mean?  Diff runs in O( (n+r) log n) time, where
n is the size of the input, and r is the number of pairs of positions 
in which the the two sequences match (c.f. Hunt & Szymanski).

>As I remarked earlier, my experience is that the time to find the 
>longest common subsequence is dominated by the sort time, even if a
>linear sort if being used.  (By about a factor of 4 to 1, as I recall.)
>Am I missing something?  Do diff and/or fcomp bypass the sort?

Diff does a sort, so you are basically correct there.  Fcomp does no
sorting at all, and uses a completely different algorythm in any case.

-- Scott Schwartz            schwartz@gondor.cs.psu.edu

g-rh@cca.UUCP (01/23/88)

In article <3238@psuvax1.psu.edu> schwartz@gondor.cs.psu.edu (Scott E. Schwartz) writes:
>In article <23307@cca.CCA.COM> g-rh@CCA.CCA.COM.UUCP (Richard Harter) writes:
>>But, but, but...  Finding the longest common subsequence is a linear
>>process.  Or at least it is sufficiently close to being linear that
>>it makes no difference.  I don't know how diff does it or fcomp does
>>it, but the algorithm that I use builds a ordered list of intermediate
>>subsequence heads.  Each new item goes onto a list head.  The trick to
>>making it quasi-linear is to do an interpolation search when placing
>>a new item into the list of intermediates.  [This isn't really linear,
>>but, given the conditions, might as well be.]
>
>I don't know if finding the longest common subsequence is a linear process,
>in general.  All of the published algorythms I've seen have some sort of
>dependency on the input data that stops one from making that claim.
>
>Which conditions are you referring to?  Also, when you say quasi-linear,
>what does that really mean?  Diff runs in O( (n+r) log n) time, where
>n is the size of the input, and r is the number of pairs of positions 
>in which the the two sequences match (c.f. Hunt & Szymanski).
>
	Since what I use may not be the same as diff, and may be an
unpublished algorithm, I will run through the steps and take it from
there.

(1)	Hash each line in each file into an integer to produce two
arrays of integers, representing the files.  Also create an index
array for each file array.

(2)	Sort the file arrays by permuting the index arrays.

(3)	Go through the index arrays to get two arrays of matches.

(4)	Sort the match list for the old file, and permute the match
list for the new file to correspond.

(5)	Extract the longest ascending subsequence of the match list
for the new file.  This is the longest common subsequence of the two
files.

(6)	Make a backfill pass to account for identical lines that were
misaligned.

(7)	Convert the l.c.s to difference format.

	The step in question is step five, which is a permuted array
of indices with some integers missing.  This is a little tricky, so
I am going to lift the documentation of the routine that does it.

"The algorithm runs as follows:  Suppose that, for a given sequence,
we have determined ascending subsequences of all possible lengths
such that each is headed by the largest possible value for subsequences
of that length.  Now consider a new sequence formed by adding a new value
to the head of the first sequence.  There are three possible case --
the new value .le. the head of the longest subsequence; it is .gt.
that the head of the longest but .le. than some shorter subsequence,
or it is .gt. the heads of all subsequences.  In the first case the
new subsequence has a subsequence which is one value longer than the
longest subsequence, headed by the added value.  In the second case
let k be the length of the subsequence such that the new value is .le.
its head but .gt. than the next longer subsequence.  In this case the
new subsequence of length k+1 headed by the new value and followed
by the old subsequence of length k.  In the final case the new value
is a new 'best' sequence of length one."

	There is more, but that's the key.  There are various ingenuities
which are not relevant here.  The point is that the algorithm constructs
an ordered array of list heads, all data being integers in the range
1 to n.  At each step you get a new value which must be placed in the
array; if it is smaller it extends the array, otherwise it replaces
the value it is just larger than.

	The factor that is important is the time it takes to locate
the right position for the new item.  This particular code checks to
see if the item is a new longest or a new 'best' of length one.  (If
the files match each item is an extension).  In the general case one
has to do a search.  The code uses an interpolation search.  Since
the values are integers in the range 1...n the search is fast.  This
is why I said quasi-linear.  The task of determining the actual O() is
not trivial.

	I don't know how close this is to the algorithm that diff uses.
The length of the array of subsequence heads ends up being the number
of matches (by position, not by value).  The n is the number of matches
by value rather than the file length(s).  If you used a general purpose
binary seach, you would get O(n *log(r)) for the l.a.s. algorithm.  This
suggests that diff is, you should excuse the expression, doing it differ-
ently.

	This code actually uses three sorts -- one on each of the hash
arrays and one on the match indices of the old file.  In this situation
an O(n) math sort is feasible.  Let n_o be the old file length, n_n
the new file length, n_v the number of matches before order is taken
into account, and r the number of matches after order is taken into
account.  If the search is O(n*log(log(r)), which I suspect it is, the
time is  O(n_o) +O(n_n) + O(n_v) + O(n_v*log(log(r)).  In principle
the latter factor dominates.  In practice, the sort times dominate.


>Diff does a sort, so you are basically correct there.  Fcomp does no
>sorting at all, and uses a completely different algorythm in any case.

	Color me curious.  I understand l.c.s difference schemes and
sliding window schemes.  Can you give a quick summary of the idea of
fcomp?
-- 

In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
	Richard Harter, SMDS  Inc.

guy@gorodish.Sun.COM (Guy Harris) (01/23/88)

> >The trouble with diff(1) is that it assumes that the only changes are
> >insertions and deletions.  So if you swap two functions, it reports
> >two changes.  The method which appeared in CACM can spot moves as well
> >as insertions and deletions.
> 
> The "hdiff" program does move detection.  It's in the comp.sources.unix
> archives.

That's because, according to the comments, it implements the algorithm
described in the CACM paper, which was:

	"A Technique for Isolating Differences Between Files", Paul Heckel,
	Interactive Systems Consultants, CACM, Volume 21, Number 4, April 1978,
	pp. 264-268.

The "h" stands for "Heckel".  The paper mentions the Hunt/McIlroy paper in
passing, saying:

	The only implementation of (Longest Common Subsequence) for file
	comparison of which the author is aware takes linear time and space for
	most practical cases but takes worst-case O(m*n*log n) time and O(m*n)
	space.  As a practical LCS implementation, it uses several clever
	techniques and is neither concise nor easy to understand.
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com