[comp.compression] compressing diffs

gudeman@cs.arizona.edu (David Gudeman) (04/25/91)

Some programs such as SCCS and RCS store multiple versions of a source
file in a single version file.  They work by storing edit scripts such
as are produced by diff.  For example SCCS stores only the first
version in its entirety, all other versions are stored in the form of
an edit script that edits the original file to the new one.

These aggregate files have several characteristics that would make
them good candidates for keeping in compressed form: they are not
accessed frequently, they have special formats anyway (so compression
would not make them less accessible), and they can get very large.

Would this sort of compression benefit much from special compression
algorithms?  For example, do diffs have any special properties that
could be used?  The process of making an edit script seems similar to
some compression algorithms, could compression and diff'ing be
combined?

Could you use an "mutual-dictionary" compression method (one where you
build the dictionary over all versions of the file) to get an effect
similar to edit scripts (since it would tend to put common strings in
the dictionary)?  As far as I know, edit scripts are line-based (or
otherwise coarse-grained), would a mutual-dictionary compression
method be finer-grained?  Is this already being done and my software
tools book is outdated?

Enquiring minds want to know.
--
					David Gudeman
gudeman@cs.arizona.edu
noao!arizona!gudeman