@RELAY.CS.NET,@humus.huji.ac.IL:amoss@batata.bitne (Amos Shapira) (01/13/88)
Does any one knows where to find a description of the algorithm used in Unix diff(1)? I know that it was invented by one named Harold Stone, nothing more. Thanks in advance, --Amos Shapira P.S. If you think that there is another news group that might be able to help me then please send me a message.
gwyn@brl-smoke.ARPA (Doug Gwyn ) (01/13/88)
In article <11229@brl-adm.ARPA> @RELAY.CS.NET,@humus.huji.ac.IL:amoss@batata.bitne (Amos Shapira) writes: >Does any one knows where to find a description of the algorithm >used in Unix diff(1)? It is described in Bell Labs CSTR #41, "An Algorithm for Differential File Comparison" by J. W. Hunt and M. D. McIlroy (June 1976). Unfortunately this seems to be out of print. I think there were some errors in the report, too. I attempted to rewrite "diff" once, and found some minor speed improvements that could be made to the stock UNIX version. The comments in the UNIX version were fairly accurate and complete, and they're probably more useful than the CSTR anyway. I seem to recall a CACM article some time in the past several years that presented a supposedly better diff algorithm, and it may have also discussed the UNIX diff algorithm.
g-rh@cca.CCA.COM (Richard Harter) (01/14/88)
In article <7068@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes: >In article <11229@brl-adm.ARPA> @RELAY.CS.NET,@humus.huji.ac.IL:amoss@batata.bitne (Amos Shapira) writes: >>Does any one knows where to find a description of the algorithm >>used in Unix diff(1)? > >It is described in Bell Labs CSTR #41, "An Algorithm for Differential >File Comparison" by J. W. Hunt and M. D. McIlroy (June 1976). >Unfortunately this seems to be out of print. I think there were some >errors in the report, too. I attempted to rewrite "diff" once, and >found some minor speed improvements that could be made to the stock >UNIX version. The comments in the UNIX version were fairly accurate >and complete, and they're probably more useful than the CSTR anyway. > >I seem to recall a CACM article some time in the past several years >that presented a supposedly better diff algorithm, and it may have >also discussed the UNIX diff algorithm. I used to have a copy of the Bell report, which of course had disappeared when I had occasion to actually write a diff program. My recollection is that the original is rather obscure. However the essential idea is fairly simple, and has four steps: (1) Hash each line in each file to produce an array of integers. (2) Sort the two arrays of hash integers, retaining pointers back to the original lines. (3) Compare the two lists and extract the longest common ascending subsequence. (4) Relate the l.a.s. back to the original files. There is a straightforward algorithm which is mathematically rather pretty for step three -- if the hash integers are all unique. The tricky part is what to do when there are duplicates, i.e. multiple occurences of the same text line in the file. The solution I used was to ignore the possibility of mis-assignment of duplicates during the las pass and do a gap fill afterwards; I don't know diff handles this. This was written as part of a larger program, so I don't have a comparison of 'mydiff' versus diff. However I did do timing analyses on 'mydiff'. It turns out that the critical part of the code is the sort. In this case, the arrays being sorted are uniformly distributed numbers, and a histogram math sort wins, since you can write an O(n) sort. The l.c.a.s code can be made O(n) is space and time. The gap fill code is very fast but is tricky to get right. The sort that I wrote, by the way, has a line of code that breaks the optimizer on the Masscomp C compiler (unless they have finally gotten a fix in on it.) It runs as follows for (i=0;i<n;i++) y[c[*ir1++]++] = *ir2++; That little gem is a one pass sort loop! (I won't try to explain it.) Apparently this particular optimizer gets confused about the levels of indirection. In the original report, I believe that they assumed that the probability of two lines hashing to the same hash value was so small that one didn't have to go back afterwards and check. I don't know if this assumption is still retained in diff. The line hashing (and recheck if you do it) use signifigant amounts of time but are still dominated by the sort time. -- In the fields of Hell where the grass grows high Are the graves of dreams allowed to die. Richard Harter, SMDS Inc.
ok@quintus.UUCP (Richard A. O'Keefe) (01/14/88)
There is a public domain version of diff(1) coded in C that you can get from a comp.sources.misc archive. Subject: v02i001: Public domain diff Keywords: diff X-Archive: comp.sources.misc/8801/1 Comp.sources.misc: Volume 2, Issue 1 Archive-Name: pd-diff Submitted-By: blarson@skat.usc.edu (Bob Larson) I believe it started life as part of the DECUS goodies. I have it, but haven't installed it yet. The trouble with diff(1) is that it assumes that the only changes are insertions and deletions. So if you swap two functions, it reports two changes. The method which appeared in CACM can spot moves as well as insertions and deletions. I have a Pascal implementation of that method, which works well enough to use but not well enough to publish. There are some things you need that the article didn't explain.
jaw@aurora.UUCP (James A. Woods) (01/15/88)
also investigate the webb miller / eugene myers code presented in "a file comparison program", software practice and experience, november 1985. they (miller was acm algorithms editor, now at univ. of arizona) claim a factor of four speed increase over standard 'diff' for typical files.
schwartz@gondor.cs.psu.edu (Scott E. Schwartz) (01/16/88)
In article <1571@aurora.UUCP> jaw@aurora.UUCP (James A. Woods) writes: >also investigate the webb miller / eugene myers code presented in >"a file comparison program", software practice and experience, >november 1985. > >they (miller was acm algorithms editor, now at univ. of arizona) >claim a factor of four speed increase over standard 'diff' for >typical files. Webb is actually at Penn State. Gene Myers is at Arizona, except that he is here this year. Recently they came up with an enhancement to their algorythm ("Haslock's refinement") that doubles (in the worst case) the speed of the original algorythm. Best case (in which all edits are insertions or are all deletions) runs in linear time. To be fair, there are cases where diff is faster. Diff does best when the longest common subsequence is short, i.e. lots of diffs. Fcomp (as they call it) does best when the edit script is short, i.e. few diffs. -- Scott Schwartz schwartz@gondor.cs.psu.edu
g-rh@cca.CCA.COM (Richard Harter) (01/16/88)
In article <3219@psuvax1.psu.edu> schwartz@gondor.cs.psu.edu (Scott E. Schwartz) writes: >>also investigate the webb miller / eugene myers code presented in > >Webb is actually at Penn State. Gene Myers is at Arizona, except that >he is here this year. > >Recently they came up with an enhancement to their algorythm >("Haslock's refinement") that doubles (in the worst case) the speed >of the original algorythm. Best case (in which all edits are insertions or >are all deletions) runs in linear time. > >To be fair, there are cases where diff is faster. Diff does best when >the longest common subsequence is short, i.e. lots of diffs. Fcomp >(as they call it) does best when the edit script is short, i.e. few diffs. But, but, but... Finding the longest common subsequence is a linear process. Or at least it is sufficiently close to being linear that it makes no difference. I don't know how diff does it or fcomp does it, but the algorithm that I use builds a ordered list of intermediate subsequence heads. Each new item goes onto a list head. The trick to making it quasi-linear is to do an interpolation search when placing a new item into the list of intermediates. [This isn't really linear, but, given the conditions, might as well be.] As I remarked earlier, my experience is that the time to find the longest common subsequence is dominated by the sort time, even if a linear sort if being used. (By about a factor of 4 to 1, as I recall.) Am I missing something? Do diff and/or fcomp bypass the sort? -- In the fields of Hell where the grass grows high Are the graves of dreams allowed to die. Richard Harter, SMDS Inc.
blm@cxsea.UUCP (01/16/88)
James A. Woods (jaw@aurora.UUCP) writes: |also investigate the webb miller / eugene myers code presented in |"a file comparison program", software practice and experience, |november 1985. |they (miller was acm algorithms editor, now at univ. of arizona) |claim a factor of four speed increase over standard 'diff' for |typical files. I've got a copy of this article, and entered the C implementation included in the article (OK, I had the secretary here enter it). Anyways, it does appear to be quite a bit faster than diff (the System V, Release 2 diff). The algorithm is a nice algorithm, but the implementation would need to be tweeked some before I would consider it usable everyday. (Note that this should not be construed as a flame agains the authors of the article. The implementation is straight-forward, and demonstrates the algorithm nicely.) For instance, it uses a LOT of memory, and when I profiled it, about 94% of the time was spent in malloc! One of the things I found most interesting about the algorithm is that it is basically the same algorithm that Gosling Emacs uses to calculate optimal screen updates! If someone plans on implementing a public-domain diff supporting -b and -c, this algorithm is definitely worth looking at. Just don't use the implementation in the article, or spend a lot of time optimizing it and decreasing it's memory usage. -- Brian L. Matthews "A power tool is not a toy. ...{mnetor,uw-beaver!ssc-vax}!cxsea!blm Unix is a power tool." +1 206 251 6811 Computer X Inc. - a division of Motorola New Enterprises
chip@ateng.UUCP (Chip Salzenberg) (01/19/88)
In article <532@cresswell.quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes: >The trouble with diff(1) is that it assumes that the only changes are >insertions and deletions. So if you swap two functions, it reports >two changes. The method which appeared in CACM can spot moves as well >as insertions and deletions. The "hdiff" program does move detection. It's in the comp.sources.unix archives. -- Chip Salzenberg UUCP: "{codas,uunet}!ateng!chip" A T Engineering My employer's opinions are a trade secret. "Anything that works is better than anything that doesn't." -- me
schwartz@gondor.cs.psu.edu (Scott E. Schwartz) (01/23/88)
In article <23307@cca.CCA.COM> g-rh@CCA.CCA.COM.UUCP (Richard Harter) writes: >But, but, but... Finding the longest common subsequence is a linear >process. Or at least it is sufficiently close to being linear that >it makes no difference. I don't know how diff does it or fcomp does >it, but the algorithm that I use builds a ordered list of intermediate >subsequence heads. Each new item goes onto a list head. The trick to >making it quasi-linear is to do an interpolation search when placing >a new item into the list of intermediates. [This isn't really linear, >but, given the conditions, might as well be.] I don't know if finding the longest common subsequence is a linear process, in general. All of the published algorythms I've seen have some sort of dependency on the input data that stops one from making that claim. Which conditions are you referring to? Also, when you say quasi-linear, what does that really mean? Diff runs in O( (n+r) log n) time, where n is the size of the input, and r is the number of pairs of positions in which the the two sequences match (c.f. Hunt & Szymanski). >As I remarked earlier, my experience is that the time to find the >longest common subsequence is dominated by the sort time, even if a >linear sort if being used. (By about a factor of 4 to 1, as I recall.) >Am I missing something? Do diff and/or fcomp bypass the sort? Diff does a sort, so you are basically correct there. Fcomp does no sorting at all, and uses a completely different algorythm in any case. -- Scott Schwartz schwartz@gondor.cs.psu.edu
g-rh@cca.UUCP (01/23/88)
In article <3238@psuvax1.psu.edu> schwartz@gondor.cs.psu.edu (Scott E. Schwartz) writes: >In article <23307@cca.CCA.COM> g-rh@CCA.CCA.COM.UUCP (Richard Harter) writes: >>But, but, but... Finding the longest common subsequence is a linear >>process. Or at least it is sufficiently close to being linear that >>it makes no difference. I don't know how diff does it or fcomp does >>it, but the algorithm that I use builds a ordered list of intermediate >>subsequence heads. Each new item goes onto a list head. The trick to >>making it quasi-linear is to do an interpolation search when placing >>a new item into the list of intermediates. [This isn't really linear, >>but, given the conditions, might as well be.] > >I don't know if finding the longest common subsequence is a linear process, >in general. All of the published algorythms I've seen have some sort of >dependency on the input data that stops one from making that claim. > >Which conditions are you referring to? Also, when you say quasi-linear, >what does that really mean? Diff runs in O( (n+r) log n) time, where >n is the size of the input, and r is the number of pairs of positions >in which the the two sequences match (c.f. Hunt & Szymanski). > Since what I use may not be the same as diff, and may be an unpublished algorithm, I will run through the steps and take it from there. (1) Hash each line in each file into an integer to produce two arrays of integers, representing the files. Also create an index array for each file array. (2) Sort the file arrays by permuting the index arrays. (3) Go through the index arrays to get two arrays of matches. (4) Sort the match list for the old file, and permute the match list for the new file to correspond. (5) Extract the longest ascending subsequence of the match list for the new file. This is the longest common subsequence of the two files. (6) Make a backfill pass to account for identical lines that were misaligned. (7) Convert the l.c.s to difference format. The step in question is step five, which is a permuted array of indices with some integers missing. This is a little tricky, so I am going to lift the documentation of the routine that does it. "The algorithm runs as follows: Suppose that, for a given sequence, we have determined ascending subsequences of all possible lengths such that each is headed by the largest possible value for subsequences of that length. Now consider a new sequence formed by adding a new value to the head of the first sequence. There are three possible case -- the new value .le. the head of the longest subsequence; it is .gt. that the head of the longest but .le. than some shorter subsequence, or it is .gt. the heads of all subsequences. In the first case the new subsequence has a subsequence which is one value longer than the longest subsequence, headed by the added value. In the second case let k be the length of the subsequence such that the new value is .le. its head but .gt. than the next longer subsequence. In this case the new subsequence of length k+1 headed by the new value and followed by the old subsequence of length k. In the final case the new value is a new 'best' sequence of length one." There is more, but that's the key. There are various ingenuities which are not relevant here. The point is that the algorithm constructs an ordered array of list heads, all data being integers in the range 1 to n. At each step you get a new value which must be placed in the array; if it is smaller it extends the array, otherwise it replaces the value it is just larger than. The factor that is important is the time it takes to locate the right position for the new item. This particular code checks to see if the item is a new longest or a new 'best' of length one. (If the files match each item is an extension). In the general case one has to do a search. The code uses an interpolation search. Since the values are integers in the range 1...n the search is fast. This is why I said quasi-linear. The task of determining the actual O() is not trivial. I don't know how close this is to the algorithm that diff uses. The length of the array of subsequence heads ends up being the number of matches (by position, not by value). The n is the number of matches by value rather than the file length(s). If you used a general purpose binary seach, you would get O(n *log(r)) for the l.a.s. algorithm. This suggests that diff is, you should excuse the expression, doing it differ- ently. This code actually uses three sorts -- one on each of the hash arrays and one on the match indices of the old file. In this situation an O(n) math sort is feasible. Let n_o be the old file length, n_n the new file length, n_v the number of matches before order is taken into account, and r the number of matches after order is taken into account. If the search is O(n*log(log(r)), which I suspect it is, the time is O(n_o) +O(n_n) + O(n_v) + O(n_v*log(log(r)). In principle the latter factor dominates. In practice, the sort times dominate. >Diff does a sort, so you are basically correct there. Fcomp does no >sorting at all, and uses a completely different algorythm in any case. Color me curious. I understand l.c.s difference schemes and sliding window schemes. Can you give a quick summary of the idea of fcomp? -- In the fields of Hell where the grass grows high Are the graves of dreams allowed to die. Richard Harter, SMDS Inc.
guy@gorodish.Sun.COM (Guy Harris) (01/23/88)
> >The trouble with diff(1) is that it assumes that the only changes are > >insertions and deletions. So if you swap two functions, it reports > >two changes. The method which appeared in CACM can spot moves as well > >as insertions and deletions. > > The "hdiff" program does move detection. It's in the comp.sources.unix > archives. That's because, according to the comments, it implements the algorithm described in the CACM paper, which was: "A Technique for Isolating Differences Between Files", Paul Heckel, Interactive Systems Consultants, CACM, Volume 21, Number 4, April 1978, pp. 264-268. The "h" stands for "Heckel". The paper mentions the Hunt/McIlroy paper in passing, saying: The only implementation of (Longest Common Subsequence) for file comparison of which the author is aware takes linear time and space for most practical cases but takes worst-case O(m*n*log n) time and O(m*n) space. As a practical LCS implementation, it uses several clever techniques and is neither concise nor easy to understand. Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com