[comp.compression] Compression figures

victor@watson.ibm.com (Victor Miller) (04/15/91)

I've just come back from DCC '91 (which was a pretty good conference),
and was reminded of one of my pet peeves: reporting compression
figures.  There doesn't seem to be any standard way.  The following
seem to be in use: (Orig = original size, New=compressed size)

1) Orig/New
2) New/Orig
3) (Orig-New)/Orig
4) Orig/(B*New)
	Here B is either the byte size (as in bits/byte) or the number
of bits/pel (as in bits/pel).

In addition, 2 and 3 are often reported as percentages, without
specifying the % sign.  I really don't like 3): it gives the amount
saved.  It does have the property the the bigger the value, the better
compression.  I actually favor 4), because it gives a normalized
measure of compression: it doesn't matter what the byte size is: if
you have 7 bit bytes, or 8 bit bytes the figure comes out the same.
On the other hand, 1) isn't too bad, since it gives a larger figure
for better compression.  At the very least, I would plead with people
reporting compression figures explicitly to give the method they use
in calculating these results, since there isn't any standard.

--
			Victor S. Miller
			Vnet and Bitnet:  VICTOR at WATSON
			Internet: victor@watson.ibm.com
			IBM, TJ Watson Research Center

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (04/16/91)

In article <VICTOR.91Apr15114057@irt.watson.ibm.com> victor@watson.ibm.com writes:
> I've just come back from DCC '91 (which was a pretty good conference),
> and was reminded of one of my pet peeves: reporting compression
> figures.  There doesn't seem to be any standard way.

I'm all in favor of reporting the original size and the compressed size,
at least in research papers. It's hard to get more direct than that. Any
extra information is at best a minor convenience.

> 1) Orig/New
> 2) New/Orig
> 3) (Orig-New)/Orig
> 4) Orig/(B*New)

The problem with (1) is that for a 20K file, the difference between 2K
and 1.81K compressed is magnified (1000% versus 1105%), while the vastly
more important difference between 40K and 45K compressed is shrunk (50%
versus 44%). (4) pleases the information theorist types but is rather
annoying to use in practice.

Between (2) and (3)... well, data points:

Script started on Tue Apr 16 10:43:00 EDT 1991
csh> compress -v < /etc/hosts > /dev/null
Compression: 63.46%
csh> yabba -v < /etc/hosts > /dev/null
In: 311614 chars  Out: 103117 chars  Y'ed to: 33%
csh> yabba -^ < /etc/hosts > /dev/null
In: 311614 chars  Out: 103117 chars  Y'ed by: 67%
csh> Script done on Tue Apr 16 10:44:00 EDT 1991

I hope -^ (-v inverted) is sufficiently mnemonic for people who like (3).

---Dan

wayne@csri.toronto.edu (Wayne Hayes) (04/21/91)

In article <VICTOR.91Apr15114057@irt.watson.ibm.com> victor@watson.ibm.com writes:
> I've just come back from DCC '91 (which was a pretty good conference),
> and was reminded of one of my pet peeves: reporting compression
> figures.  There doesn't seem to be any standard way.
>
> 1) Orig/New
> 2) New/Orig
> 3) (Orig-New)/Orig
> 4) Orig/(B*New)

I much prefer 2) over all the others.  It's the most obvious: it gives
you the size of the new one compared to the old one.  That's usually
how size differences are compared:   (new thing) vs (old thing).  So
30% means it's 3/10ths the size of the original.  This is COMPRESSION,
so we should be thinking "smaller is better", not "bigger is better."
My vote goes whole-heartedly behind 2).  It's also the least ambiguous.
I always get 1 and 3 confused: does 100% mean nothing happened or
we got a factor of 2?  With 2), it means nothing happened.  Easy.
Yes, it takes a bit of a view shift, but once you're used to it it's
the easiest method to understand by far.

-- 
NOTICE: Due to the complexity of nearly all topics, the opinions expressed
above are in continual process of formation and may be changed without notice.

Wayne Hayes     INTERNET: wayne@csri.utoronto.ca        CompuServe: 72401,3525