[net.unix-wizards] diff alg ref query

tim (08/02/82)

	To: tim
	Subject: diff algorithm

	Do you think that you could ask on the net if anyone
	knows of the original reference on the algorithm that
	diff uses ?  The source says that it is "due to Harold
	Stone" but gives no other references.  I thought I
	remembered a CACM number in source but I was wrong -
	the CACM # was for where they used a shellsort.
	I could check in the libarary using the name Stone
	but I thought it might be easier to ask on the net
	if you think thats reasonable.  The reason I am
	interested in the algorithm is that I am unhappy
	with the source compare program I have now and
	would like to write a better one but the algorithm
	is not immediately apparent when reading the source for
	diffreg.c.  Let me know what you think.

Can anybody give an answer here?

			Tim Curry
			USENET:  ucf-cs!tim
			ARPANET: tim.ucf-cs@udel-relay

jhf (08/04/82)

I have a copy of report, apparently from Bell Labs, entitled "An algorithm
for differential file comparison," by J.  W.  Hunt and M.  D.  McIlroy,
Computing Science Technical Report #41.  Because it is a Xerox copy of the
published report, without a cover, I'm not absolutely sure it was issued by
Bell Labs.  At any rate, this is the only detailed account of the diff
algorithm I have seen.

The reason I am posting this information to the net, rather than replying
to the request by mail, is that I want to follow up with another question
about diff - It's a clever algorithm, and its greatest virtue is that it
produces optimal results (except for perturbations caused by hashing
collisions), but its limitation is said to be that it can't handle
arbitrarily long files.  My question is this:  What's wrong with running
the algorithm piecewise over big files?  That is, finding the differences
in the first few thousand lines of the files, then restarting at the points
of the last difference found, and so forth.  I suppose that the results
would be nonoptimal, but they would be piecewise optimal, and presumably,
the quality of the results would vary with the size of the pieces, giving
one a nice tradeoff between memory requirements and quality of otuput.
Notice that the usual dumb algorithm for file comparison is obtained as a
degenerate case with a very small (1 or 2 lines) piece size.

This all seems so obvious.  Am I missing something?

johnl (08/06/82)

Threre's nothing wrong with doing "diff" piecewise on very large files.
The "bdiff" program that came with PWB and now with 3.0 does exactly that.

John Levine decvax!cca!ima!johnl

djj (08/06/82)

Re - running diff in pieces

What problems would arise if a block of changes exceeded the "piece-size"
chosen?  It seems to me that this piece-wise diff would certainly generate
non-optimal answers.  Might this be a gap in the logic?

Dave Johnson
BTL - Piscataway

gb (08/09/82)

"A Technique for Isolating Differences Between Files" by Paul Heckel in the
April 1978 issue of CACM has an algorithm for diff'ing that he claims is
better than the Hunt McIlroy algorithm.  It certainly looks simple enough
and runs in linear time.