ross@spam.ua.oz.au (Ross Williams) (06/27/91)
Two minor notes on data compression details: Decompression Times ------------------- When quoting speeds of decompression, is everyone agreed that the speed should be given relative to the decompressed data? That is, if I say that my decompression algorithm goes at 50K/second, does everyone agree that I should mean that it is WRITING 50K/s, not READING 50K/second. Averages -------- I recently polished my algorithm test harness on my macintosh and ran up against a problem with averaging. My test harness runs the algorithm under test on a suite of files (e.g. the calgary corpus). For each file, it records the compression performance (e.g. 3.48 bits per byte) in a floating point variable. It also prints out the variable to two decimal paces (as seems to have become standard). The problem arises when I get to the end of the corpus and want to print out the average of the compression performances. The question is: should I print out the average of the internal, highly accurate floating point numbers or should I print out the average of the printed results rounded to 2dp. I did some analysis and worked out that the "deep" average (the rounding of the average of the unrounded values) and the "shallow" average (the rounding of the average of the rounded values) can differ by up to one rounded digit unit value (Example: try 1.5 and 2.5 rounding to 1dp). Choosing the deep average means that if anyone attacks my statistics with a calculator at a later date, they might conclude that I made a mistake. Choosing the shallow average means losing information within the specified rounding range unneccessarily. This problem must arise in science all the time. Is there an accepted convention? When pressed, I ended up implementing the shallow (average the rounded results) average to avoid people thinking I made a mistake. (This implementation was made difficult by a rounding error in the printf in my C library!!!) Ross Williams ross@spam.ua.oz.au
cthombor@umnd-cs-gw.d.umn.edu (Clark Thomborson) (06/27/91)
As Ross guessed, the question of how to ``average'' a series of performance figures has indeed been studied a lot. My reply is based on Chapters 1 and 2 of Hennessy & Patterson's new text, {\it Computer Architecture: A Quantitative Approach}, Morgan-Kaufmann, 1990. This book has lots of great stuff in it, beyond its discussion of performance averaging, including a chapter on what I expect is the next ``standard'' workstation architecture, a RISC cpu with a Cray-like floating-point pipe. No, I am not getting a kickback on their sales! In brief, the hard part of performance averaging is to figure out the application. It makes sense to report the total time required to process a representative benchmark dataset. Perhaps this would be a directory with a representative mix of files. This view lays bare the inherent problem in interpreting ``average performance'': if you measure your program's runtime on a directory that contains a lot of C source files, and my intended use is for decompressing a single binary boot file, your performance figure is of little use to me. I disagree with Hennessy & Patterson on relatively minor points, in that I think one should report mins and maxes, as well as averages. I also like speed figures (Kbytes/second in the data compression business, MIPS in the architecture business) more than they seem to: I can interpret speeds more or less directly into a *rough* runtime estimate for my own application, but I can't always figure out the appropriate scaling factor to convert their ``total time'' into a runtime estimate for my application. For example, if you reported the range of decompression speeds you observed on various types of files, I would probably be able to decide if your compressor is fast enough for my binary boot file application. If, instead, you reported just a total time (or an average speed) for compressing a directory with a ``representative mix'' of binaries, text files, and picture files, I would be worried. Perhaps the binaries ran a lot faster or slower than the others (for some reason). In this case, the total time or average speed would not tell me much about my application. If all the files run at about the same speed, reporting mins and maxes will reveal this; totals and averages will leave me in the dark. Hennessy and Patterson point out that ``total time'' has a significant advantage: it's relatively hard to cheat on its definition. For example, the MIPS rating of a cpu has become discredited, due to the widespread practice of choosing ones' benchmark to maximize the reported figure. In your context, reporting total time would dispose of the question of whether one should measure decompression speed as time/input_size or time/output_size. I would argue that you should still report speeds, but *clearly state your definition*. You should also report your runtime conditions: cpu, clock speed, computer manufacturer and model number, operating system name and version number, compiler name and version number, compiler optimization setting, etc. If you are measuring compression performance on a directory full of files, the IO configuration of your system becomes important: at a minimum, you should disclose whether you are using a floppy disk or a hard disk, but the disk model number and its bus type (SCSI? ESDI?) are also important. As for your question on ``deep'' versus ``shallow'' averages, the accepted practice is to carry out all calculations at high precision, then round just before printing. As you noticed, this means your published data will not always look consistent, although it does mean that every number is as accurately reported as possible. The only exception I have noticed to this rule is that an accountant will sometimes fudge the rounding of a number, so that the printed (and accurately rounded) total is consistent with the sum of the (perhaps inaccurately rounded) terms. Clark
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/28/91)
The ``average speed'' taken by a compressor over different types of input files is meaningless. Do you know anybody who compresses exactly as many bytes of ``book1'' per day as of ``obj1'', and exactly as many of ``news'' and ``trans'' and ``progp''? I don't. (Worse than that, it's a harmonic average, so even if someone *does* compress exactly as much of each type of file, the ``average speed'' will not equal his real speed. ``Average speed'' means if you spend the same amount of *time* on each type of text, how much do you get compressed overall. Ugh.) Most people deal with some combination of ``book1'' and ``progc'' (archives) and ``news''. Their average will be very different from the overall average on any given text corpus. The *only* meaningful statements you can make are about the boundary cases: ``The worst observed speed from this compressor is 30K/sec, the best observed speed is 60K/sec,'' and then how it does on each file type. In fact, one excellent way to display benchmarks with varying parameters (the parameter in this case being the input file type) is the following: (1) Select the benchmarks to be displayed: in this case, compression effectiveness and compression time, on a two-dimensional chart. (2) For each value of the parameter, plot a labelled point at the result. (3) Draw the convex hull of the points; i.e., connect the outer dots. Repeat for each compression method under consideration. Now what do the statistics mean on this chart? Each point reflects a result for one method on one file type. The top of the convex hull for one method is the maximum speed; the bottom is the minimum speed; the left is the maximum effectiveness; the right is the minimum effectiveness. The ``spread'' of the convex hull, i.e., its area, shows how sensitive the method is to the file type. No matter what kind of files someone compresses, his results will always be a convex combination of some of those outer points. Plotting the ``average'' speed or ``average'' effectiveness means that you're taking the center of mass of the points---as if everyone compressed just as much of every kind of file---and ignoring the interesting regions on the edge of the convex hull. End of sermon A. Begin sermon B. Even without meaningless averages there's way too much information for one person to take in all at once. Another dose of reality: People don't really want to get some abstract numeric measure of ``how good'' a compressor is on all types of text. They know beforehand that they're only interested in news, say, or progc. SO SHOW THE RESULTS FOR *ALL* COMPRESSORS ON *ONE* FILE IN *ONE* PLACE! Someone who's deciding what to do with his newsfeed wants to see a single row or column showing how *every* compressor does on a sample of news. Witness this excerpt from my Y coding paper: __bib__book1__book2____geo_makelog___news___obj1___obj2_paper1_paper2_ Z12 54094 394419 325147 78026 34757 230765 16528 160099 29433 40881 Z14 46817 357387 281354 77696 25647 202594 14048 138521 25077 37196 Z16__46528_332056_250759__77777___25647_182121__14048_128659__25077__36161_ MW___41040_336793_230862__80296____8299_168652__13273_109266__22749__34234_ AP2 47056 389702 297205 84582 20925 219665 13824 134547 26937 39415 AP6 40770 338046 261270 79471 14762 190502 13824 123323 22413 34637 AP3__40311_322178_228978__80106____8821_167896__13825_113296__22414__33320_ Y2 46882 363339 287110 80817 28159 212617 13858 141783 26131 38037 Y6 40874 320622 256578 76275 21220 185097 13858 125900 22452 33671 Y3___40456_306813_229851__76695___14411_168287__13859_114323__22453__32733_ See how easy it is to choose a news compressor? You can see immediately that MW (squeeze, 300000-string table), Y3 (yabba -m300000), and AP3 (whap -m300000) have approximately the same effectiveness, while Z16 (compress -b16) is only slightly better than Y6 (yabba -m65533). This table could be extended to any length for any number of compressors and it would still be usable. You could even add compression and decompression speeds, rankings between the compressors, etc. What's important is that all the information for a single file type be collected in one place. Now does anyone really see a need for average compressor speeds or average performance? In article <894@spam.ua.oz> ross@spam.ua.oz.au (Ross Williams) writes: > When quoting speeds of decompression, is everyone agreed that the > speed should be given relative to the decompressed data? Yes. Otherwise the number cannot be compared directly to the compression speed. Also, speed is a measure independent of the actual compression method; speed per compressed byte is not. > For each file, it records the compression > performance (e.g. 3.48 bits per byte) in a floating point variable. Why do you persist in that ugly bits-per-bytes measure? People care about how much space they save. Take out the factor of 8 already. [ do you average before rounding or after? ] Never do anything with rounded numbers unless you absolutely have to. > I did some analysis and worked out that the "deep" average (the > rounding of the average of the unrounded values) and the "shallow" > average (the rounding of the average of the rounded values) can differ > by up to one rounded digit unit value (Example: try 1.5 and 2.5 > rounding to 1dp). 1.5 and 2.5 both round to 2, so the average is the same either way. > Choosing the deep average means that if anyone attacks my statistics > with a calculator at a later date, they might conclude that I made a > mistake. Why do you think charts published in newspapers always have a disclaimer like ``Percentages may not add up to 100% because of rounding''? Anyway, if you do the sensible thing and include the original data (how many bytes of output? how much time did it take to your machine's clock precision?), nobody can possibly attack your results. ---Dan