[comp.sources.d] Shortening context diffs

davison@drivax.UUCP (Wayne Davison) (05/05/90)

I was just recently contemplating context diffs (I was mailing a 140k context
diff and had applied a 50k patch to rn), when I thought that while new-style
context diffs are much nicer than the old, we could save even more space if
we optimized the change-bar case.  And thus was born the "protext diff."

Briefly, a protext diff is a context diff with all the changes and lines of
context in one hunk.  It takes the two line-number headers and puts them on
one line, with each one's old ('-') and new ('+') starting line and section
length.  It also shortens the initial '+', '-', ' ' field to one character,
and offers an option of using a '.' instead of a ' ' for surviving the trip
around the net better.  I am also advocating the use of patch's Index: line
to indicate the name, rather than the ***/--- comments.

For comparison, here's a simple context diff:

*** orig/file	Wed May  4 22:19:48 1990
--- file	Wed May  4 22:19:54 1990
***************
*** 15,22 ****
  one
  two
  three
! OLD VERSION
  four
  five
  six
  seven
--- 15,23 ----
  one
  two
  three
! NEW VERSION
  four
+ EXTRA LINE
  five
  six
  seven

which looks like this in protext diff format:

Index: file
@@-15,8+15,9@@
 one
 two
 three
-OLD VERSION
+NEW VERSION
 four
+EXTRA LINE
 five
 six
 seven

I've created a program (currently called "frob") that will take as input new-
or old-style context diffs plus the new protext diff format, and generate a
protext or new-style context diff as output (the default is to toggle the
diff's format unless you override it).

In addition, I've also extended Larry Wall's patch program to scan for and
parse the protext diff format.

If people like the protext diff concept, I'll post the patch to "patch" and
the code for "frob", and then people could start using the new patch format.
Then, after a few months of confusion and getting everyone up to speed, we
could actually start saving some net bandwidth.  Later, the protext diff
format could be added to diff programs and the need for frob would eventually
die out.

Comments?  Do you it think it's worth pursuing?  If so, any design issues we
should consider?

Here's a few real-world examples of protext diff savings in action.  All the
patches have been trimmed of comments (automatically by frob's -s option) and
checked for accuracy.  Patches marked with an asterisk (*) were distributed as
old-style context diffs, and thus the savings are quite a bit more than those
distributed as new-style context diffs.  Since frob can generate new from old,
I've included the size the patch could have been if it had been a new-style
context diff, just in case you wanted to know.

Patch           context  protext   Saves    (new-style)
============    =======  =======   =====    ===========
rn patch 41      61050    32952    46.0% *    (45995)
rn patch 42      63262    35087    44.5% *    (48879)
rn patch 43      62212    34917    43.9% *    (41469)
rn patch 44       1854     1002    46.0% *     (1548)
rn patch 45      61732    44136    28.5%
rn patch 46      50830    26077    48.7% *    (38367)

C news 24Aug89   50211    34325    31.6%
C news 14Sep89   46093    34232    25.7%
C news 13Nov89   44530    31313    29.7%
C news 10Jan90   53315    40228    24.5%
C news 16Jan90   49912    39926    20.0%
C news 17Jan90   52755    39526    25.0%

perl patch 10    42482    30047    29.3%
perl patch 11    47175    32951    30.2%
perl patch 12    31363    22684    27.7%
perl patch 13    31799    23096    27.4%
perl patch 14    32109    23850    25.7%

gcc1.36to1.37   400085   313776    21.6%

My own patch    144046   106639    26.0%
-- 
Wayne Davison           \  /| / /| \/ /| /(_)    davison%drivax@uts.amdahl.com
davison@drivax.UUCP    (_)/ |/ /\| / / |/  \         ...!amdahl!drivax!davison