[comp.compression] Reply to Dan Bernstein's criticisms of standard.

ross@spam.ua.oz.au (Ross Williams) (06/22/91)

CRIT_006: MORE ON BLOCKING
--------------------------
From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein)
Date: 21 Jun 91 20:43:38 GMT

|| To resolve  this mess, the  standard prohibits parameters.
|| The standard also forces the algorithm designer to specify how much memory
|| the algorithm will require.
|
|I find this completely unrealistic.
|
|Ross, it sounds like you're trying to specify a standard for modem
|(i.e., zero-delay) compression. But most compression in the real world
|is high-delay. In fact, someone compressing data for an archive, or for
|news batching, is much better off with control over the parameters. That
|way he can adjust them to fit his needs without swapping in an entirely
|different ``standard'' routine. He doesn't want to make the blocking
|explicit, or to deal with the oodles of information you've defined; he
|just wants to get a file or a bunch of files compressed as efficiently
|and effectively as possible.

I'm not  convinced. I  think that  allowing algorithm  parameters will
open up a  really disgusting can of  worms. I see no  harm in having a
library of exact tuned algorithms. Swapping between different standard
routines will be far easier than fiddling parameters (which can go out
of range etc).

If a user wants to create another  tuning, he can just fiddle with the
source code  for the algorithm until  he is happy and  then fiddle the
identity record to create a new  algorithm. The standard does not stop
people fiddling  with the  parameters of an  algorithm; it  just stops
them giving such vague objects standardized IDs.

(Another  argument is  that many  algorithms can  be implemented  more
efficiently if their parameters are statically specified).

The  user  of the  algorithm  doesn't  have  to  deal with  oodles  of
information to USE  an algorithm. Only if they  want a non-implemented
tuning.  In my  experience,  a handful  of  tunings  suffice for  most
algorithms.

|> However there are two very strong arguments for the block interface:
|>    EFFICIENCY: A stream interface requires one procedure call per byte
|>    processed.  For high speed applications this is unacceptable.
|
|Huh? UNIX's read() is a stream interface, and it sure doesn't require
|one procedure call per byte processed.

You're right. One can specify a block of bytes. Sorry, I got confused.

|>    DIRECTNESS: A stream interface will force many algorithms to become
|>    involved in unnecessary buffering.
|
|Your block interface forces *all* algorithms to become involved in
|unnecessary blocking. Again it sounds like you're concentrating on
|zero-delay modem compression. I don't think compressors should be forced
|into that model.

The interface does  not force the algorithms into  blocking, it forces
the user programs into it. This  is a different thing because the user
program is  in a better position  to decide how big  the blocks should
be.

I am wary of a stream interface because it is then impossible to place
a bound on the amount of output one could get from feeding in a single
byte of  input (during either  compression or decompression).  This is
more acceptable if the input and  output are hooked up to streams, but
if they are hooked up to memory there is a big problem.

Maybe a bound could be set as a parameter.

|> In contrast, a stream interface  can always be efficiently constructed
|> over the top  of a block interface.
|
|This is not true, especially given your requirement that the block
|compressor be memoryless.

The  interface  does  not  require   the  compression  routine  to  be
memoryless, only  that it close the  coding at the end  of each block.
Were  you referring  to  that,  or did  you  think  that the  standard
required contexts to be closed at the end of every ordinary block?

So where  does all  this leave us?  If the cost  of avoiding  a second
stream layer is low, I ma in favour of it. I think that everyone could
be accommodated if the standard is modified as follows:

   * Introduce a flush flag. This effectively makes the interface
     a stream interface.
   * Somehow place an upperbound on the output that an algorithm can
     generate for a given input(i.e. place a bound on buffering).
     This could be:
     - Statically specified by the standard (e.g. 10K).
     - A property of the algorithm (in the identity record).
     - A parameter to the interface.
     I am in favour of a low (e.g. 1K) static bound.
   * Retain the memory block input and output interface.

How does that sound? I'm still thinking about it.

Ross Williams
ross@spam.ua.oz.au

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/23/91)

In article <873@spam.ua.oz> ross@spam.ua.oz.au (Ross Williams) writes:
> I'm not  convinced. I  think that  allowing algorithm  parameters will
> open up a  really disgusting can of  worms. I see no  harm in having a
> library of exact tuned algorithms.

I do.

Consider, for instance, my yabba coder. There's not only the memory
parameter -m, but a blocking test length -z and ``fuzz'' -Z that affect
compression. Twiddling -z and -Z can produce up to 10% improvements on
some types of files; there's no small set of best choices, and forcing
users into fixed values of -z and -Z, let alone -m, would be insane.

> If a user wants to create another  tuning, he can just fiddle with the
> source code  for the algorithm until  he is happy and  then fiddle the
> identity record to create a new  algorithm.

Uh-uh. This contradicts your stated goal of having the identification
numbers mean something for comparison purposes.

If you want to achieve that goal, just insist that everyone name all the
parameters (in my case, -m/-z/-Z) when quoting compression results. (It
gets even hairier when you're measuring speed and memory: for yabba,
you'd have to quote MOD and PTRS as well as the machine type.) Don't
force parameterized coders into a parameterless model.

> (Another  argument is  that many  algorithms can  be implemented  more
> efficiently if their parameters are statically specified).

Hardly.

> The  user  of the  algorithm  doesn't  have  to  deal with  oodles  of
> information to USE  an algorithm. Only if they  want a non-implemented
> tuning.

One advantage of a program-based interface is that the user is given
options. If he doesn't know about the options or doesn't want to deal
with them, he doesn't use them, but they're always there for
sophisticated users. You could mimic this in your library routine by
passing in an array of options, say of { int, int } pairs, with a
method-dependent meaning.

  [ streams versus blocks ]

I think the real argument here is over whether the compressor should
drive the I/O or vice versa. In modem applications, your standard is
fine: the I/O has to drive the compressor. But in most applications it's
much easier to work the other way round.

> |>    DIRECTNESS: A stream interface will force many algorithms to become
> |>    involved in unnecessary buffering.
> |Your block interface forces *all* algorithms to become involved in
> |unnecessary blocking. Again it sounds like you're concentrating on
> |zero-delay modem compression. I don't think compressors should be forced
> |into that model.
> The interface does  not force the algorithms into  blocking, it forces
> the user programs into it.

No, it also forces the algorithms into blocking. Many algorithms don't
want to deal with blocks. They want to deal with streams. Here, you can
notch this up as a separate criticism: You expect algorithms to have
enough special cases that they work well if you give them blocks of
three bytes at a time. But most real LZW-style compressors work horribly
on such short blocks. Most compressors do not support plaintext or
escape codes, and it is unrealistic to expect otherwise.

> I am wary of a stream interface because it is then impossible to place
> a bound on the amount of output one could get from feeding in a single
> byte of  input (during either  compression or decompression).

Again it sounds like you're focusing on modems. Why?

---Dan

gudeman@cs.arizona.edu (David Gudeman) (06/23/91)

I have to agree with Dan (readers of comp.lang.misc are probably
experiencing a feeling of unreality right now...) that parameters are
important.  Any standard that neglects them will be ignored by
everyone who implements an algorithm with natural parameters.

How about the following: a compression procedure with parameters will
use a global variable for each one (using some naming convention based
on the name of the procedure).  The user's program does not have to
refer to the global variable at all, and if it doesn't then some
natural default is used.  When you name such a procedure for
comparison to other procedures, any parameters that are not mentioned
are assumed to be the defaults.

One advantage of this for C is that in those cases where you want to
hard code the parameters you can #define them in the source of the
compression function without changing the function's formal
parameters.  (It seemed that Dan doesn't believe there can be an
advantage in hard coding the parameters, which I find peculiar.  There
are many optimizations that can be done with known values over
variables.)

As to the stream/block problem, I suggest that it is unreasonable to
try to force exactly _one_ interface on all algorithms and
applications.  Obviously there are applications for each and the
standard should apply to each.  In other words, there should be a
standard stream interface _and_ a standard block interface.  Let the
aplication programmer choose whichever interface is most appropriate
and let the compression implementer deal with any problems the given
compression procedure may have with either interface.  Add a field to
the identity records that says whether the algorithm works "best" with
streams, blocks, or either.

I'd also like to repeat someone else's suggestion of seperating the
compression procedure and decompression procedure.  I don't understand
how this effects having a compression method as a first-class
function.  So you have two first-class functions for each compression
method instead of one, what's the big deal?  For that matter, I'd add
a third function -- the one that returns the identity record.  I don't
see any value to having this be part of the compression function
itself.
--
					David Gudeman
gudeman@cs.arizona.edu
noao!arizona!gudeman

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/23/91)

In article <4496@optima.cs.arizona.edu> gudeman@cs.arizona.edu (David Gudeman) writes:
> How about the following: a compression procedure with parameters will
> use a global variable for each one (using some naming convention based
> on the name of the procedure).  The user's program does not have to
> refer to the global variable at all, and if it doesn't then some
> natural default is used.

I expect that Ross will disapprove of this because he can't pass the set
of parameters around as arguments if they're global variables. If you
have an array argument giving any number of { parameter, value } pairs
in any order, this problem disappears.

> (It seemed that Dan doesn't believe there can be an
> advantage in hard coding the parameters, which I find peculiar.  There
> are many optimizations that can be done with known values over
> variables.)

It depends. Parameters like hash table size (and hash function!) have to
be hard-coded if you want any efficiency, as do ``parameters'' like the
type you use for storing hash values. But I've never seen a parameter
that affects compression and has to be hard-coded. (You get about a 0.4%
speedup by specifying a single value of ``min'' in yabba, for example.)
Finally, any good language lets you say ``let foo() be a version of
bar() where argument x is 57'' and optimizes that independently.

> As to the stream/block problem, I suggest that it is unreasonable to
> try to force exactly _one_ interface on all algorithms and
> applications.

Exactly. I imagine modem manufacturers would love this standard, for
example, while UNIX users and programmers would love a program-based
standard. Here's a challenge, Ross: Rewrite compress to conform to your
standard. If you can do that, publish it and explain why it's useful.
I'm always wary of ``standards'' that haven't been given at least two
unrelated implementations.

> I'd also like to repeat someone else's suggestion of seperating the
> compression procedure and decompression procedure.

Ditto.

---Dan

gtoal@tharr.UUCP (Graham Toal) (06/23/91)

In article <3406.Jun2302.12.0491@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
:Exactly. I imagine modem manufacturers would love this standard, for
:example, while UNIX users and programmers would love a program-based
:standard. Here's a challenge, Ross: Rewrite compress to conform to your
:standard. If you can do that, publish it and explain why it's useful.
:I'm always wary of ``standards'' that haven't been given at least two
:unrelated implementations.
:
:> I'd also like to repeat someone else's suggestion of seperating the
:> compression procedure and decompression procedure.
:
:Ditto.

That's exactly what I was trying to say when I referred rather tactlessly
(apologies, Ross) to 'pretentious twaddle'.  Imposed standards just don't
work.  Go ahead and implement the ideas you're suggesting, publish them
in source form on the net, and let the market decide.  I've a *lot* of
time for Dan because instead of moaning about compress, he went out and
implemented something that works.  (Whether it will replace compress is
something I'm sitting back and hopefully watching for...)

Incidentally, I've implemented something much like you describe.  For it
to be portable, you have to (yuk - I do it but I hate it) mandate that
none of the data structures used by compression or decompression are
larger than 32k bytes, even if this reduces the compression you get.

And seperating out compression and decompression procs is worthwhile.

The biggest thing a standard could do for portability is this:
   1) Define the file format as a series of blocks; each block can
      be compressed with an arbitrary algorithm to a fixed interface.
   2) Define the procedure interface exactly (eg decomp(inblock,outblock,
         insize, outsize) -> status) and arrange for the encompassing
      compressed file TO INCLUDE THIS C CODE so that any decoder can
      extract it, link it in, and use it to extract the data.

This might seem a stupid suggestion but it's the only one I can
think of which will allow backwards and forwards compatibility.  otherwise
you get into the pkzip thing of rushing around tracking newer versions
of algorithms and having decompression programs with twenty different
algorithms built in.

Graham
PS Ross, I probably wouldn't have shot from the lip in that last post
if you'd titled it 'request for comments' or even 'draft standard'...
-- 
(* Posted from tharr.uucp - Public Access Unix - +44 (234) 841503 *)