victor@watson.ibm.com (Victor Miller) (04/15/91)
I've just come back from DCC '91 (which was a pretty good conference), and was reminded of one of my pet peeves: reporting compression figures. There doesn't seem to be any standard way. The following seem to be in use: (Orig = original size, New=compressed size) 1) Orig/New 2) New/Orig 3) (Orig-New)/Orig 4) Orig/(B*New) Here B is either the byte size (as in bits/byte) or the number of bits/pel (as in bits/pel). In addition, 2 and 3 are often reported as percentages, without specifying the % sign. I really don't like 3): it gives the amount saved. It does have the property the the bigger the value, the better compression. I actually favor 4), because it gives a normalized measure of compression: it doesn't matter what the byte size is: if you have 7 bit bytes, or 8 bit bytes the figure comes out the same. On the other hand, 1) isn't too bad, since it gives a larger figure for better compression. At the very least, I would plead with people reporting compression figures explicitly to give the method they use in calculating these results, since there isn't any standard. -- Victor S. Miller Vnet and Bitnet: VICTOR at WATSON Internet: victor@watson.ibm.com IBM, TJ Watson Research Center
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (04/16/91)
In article <VICTOR.91Apr15114057@irt.watson.ibm.com> victor@watson.ibm.com writes: > I've just come back from DCC '91 (which was a pretty good conference), > and was reminded of one of my pet peeves: reporting compression > figures. There doesn't seem to be any standard way. I'm all in favor of reporting the original size and the compressed size, at least in research papers. It's hard to get more direct than that. Any extra information is at best a minor convenience. > 1) Orig/New > 2) New/Orig > 3) (Orig-New)/Orig > 4) Orig/(B*New) The problem with (1) is that for a 20K file, the difference between 2K and 1.81K compressed is magnified (1000% versus 1105%), while the vastly more important difference between 40K and 45K compressed is shrunk (50% versus 44%). (4) pleases the information theorist types but is rather annoying to use in practice. Between (2) and (3)... well, data points: Script started on Tue Apr 16 10:43:00 EDT 1991 csh> compress -v < /etc/hosts > /dev/null Compression: 63.46% csh> yabba -v < /etc/hosts > /dev/null In: 311614 chars Out: 103117 chars Y'ed to: 33% csh> yabba -^ < /etc/hosts > /dev/null In: 311614 chars Out: 103117 chars Y'ed by: 67% csh> Script done on Tue Apr 16 10:44:00 EDT 1991 I hope -^ (-v inverted) is sufficiently mnemonic for people who like (3). ---Dan
wayne@csri.toronto.edu (Wayne Hayes) (04/21/91)
In article <VICTOR.91Apr15114057@irt.watson.ibm.com> victor@watson.ibm.com writes: > I've just come back from DCC '91 (which was a pretty good conference), > and was reminded of one of my pet peeves: reporting compression > figures. There doesn't seem to be any standard way. > > 1) Orig/New > 2) New/Orig > 3) (Orig-New)/Orig > 4) Orig/(B*New) I much prefer 2) over all the others. It's the most obvious: it gives you the size of the new one compared to the old one. That's usually how size differences are compared: (new thing) vs (old thing). So 30% means it's 3/10ths the size of the original. This is COMPRESSION, so we should be thinking "smaller is better", not "bigger is better." My vote goes whole-heartedly behind 2). It's also the least ambiguous. I always get 1 and 3 confused: does 100% mean nothing happened or we got a factor of 2? With 2), it means nothing happened. Easy. Yes, it takes a bit of a view shift, but once you're used to it it's the easiest method to understand by far. -- NOTICE: Due to the complexity of nearly all topics, the opinions expressed above are in continual process of formation and may be changed without notice. Wayne Hayes INTERNET: wayne@csri.utoronto.ca CompuServe: 72401,3525