tim (08/02/82)
To: tim Subject: diff algorithm Do you think that you could ask on the net if anyone knows of the original reference on the algorithm that diff uses ? The source says that it is "due to Harold Stone" but gives no other references. I thought I remembered a CACM number in source but I was wrong - the CACM # was for where they used a shellsort. I could check in the libarary using the name Stone but I thought it might be easier to ask on the net if you think thats reasonable. The reason I am interested in the algorithm is that I am unhappy with the source compare program I have now and would like to write a better one but the algorithm is not immediately apparent when reading the source for diffreg.c. Let me know what you think. Can anybody give an answer here? Tim Curry USENET: ucf-cs!tim ARPANET: tim.ucf-cs@udel-relay
jhf (08/04/82)
I have a copy of report, apparently from Bell Labs, entitled "An algorithm for differential file comparison," by J. W. Hunt and M. D. McIlroy, Computing Science Technical Report #41. Because it is a Xerox copy of the published report, without a cover, I'm not absolutely sure it was issued by Bell Labs. At any rate, this is the only detailed account of the diff algorithm I have seen. The reason I am posting this information to the net, rather than replying to the request by mail, is that I want to follow up with another question about diff - It's a clever algorithm, and its greatest virtue is that it produces optimal results (except for perturbations caused by hashing collisions), but its limitation is said to be that it can't handle arbitrarily long files. My question is this: What's wrong with running the algorithm piecewise over big files? That is, finding the differences in the first few thousand lines of the files, then restarting at the points of the last difference found, and so forth. I suppose that the results would be nonoptimal, but they would be piecewise optimal, and presumably, the quality of the results would vary with the size of the pieces, giving one a nice tradeoff between memory requirements and quality of otuput. Notice that the usual dumb algorithm for file comparison is obtained as a degenerate case with a very small (1 or 2 lines) piece size. This all seems so obvious. Am I missing something?
johnl (08/06/82)
Threre's nothing wrong with doing "diff" piecewise on very large files. The "bdiff" program that came with PWB and now with 3.0 does exactly that. John Levine decvax!cca!ima!johnl
djj (08/06/82)
Re - running diff in pieces What problems would arise if a block of changes exceeded the "piece-size" chosen? It seems to me that this piece-wise diff would certainly generate non-optimal answers. Might this be a gap in the logic? Dave Johnson BTL - Piscataway
gb (08/09/82)
"A Technique for Isolating Differences Between Files" by Paul Heckel in the April 1978 issue of CACM has an algorithm for diff'ing that he claims is better than the Hunt McIlroy algorithm. It certainly looks simple enough and runs in linear time.