[comp.compression] Proposed standard: Replies to criticisms.

ross@spam.ua.oz.au (Ross Williams) (06/27/91)

I've given up numbering the latest standard criticisms (I can only
count to F :-) and trying to decide whether to topic stream them and
so on, so I've just bunched them all up and answered them in one go.

Thanks to everyone who has contributed so far. I appreciate the time
and effort and I think that a good standard will result.

Ross Williams
ross@spam.ua.oz.au

----------------------------------------------------------------------
|From: ian@airs.com (Ian Lance Taylor)

|>OK OK I'm being a martinet again.  Agreed. The standard has to be more
|>flexible  than I  have made  it about  where memory  comes from.  I am
|>getting uneasy about how the standard is loosening up though. I wanted
|>to  place a  bound  on the  stack so  that  procedures wouldn't  cause
|>trouble on machines with a small stack space (e.g. some microprocessor
|>embedded systems).
|
|Your standard seems to be driving at two different goals at once:
|
|    1) Allow programs to use several different compression subroutines.
|
|    2) Require compression subroutines to be portable between
|       machines.
|
|The point you are making above seems to relate only to goal 2.  I
|think the real benefit of the standard will be goal 1.  It's going to
|be awfully hard to make all compression routines portable to all
|machines anyhow.  That seems to me to be, as the C language standard
|would say, a ``quality of implementation'' issue for the compression
|routine itself.  Furthermore, an implementation on a machine with lots
|of stack space could use the stack; an implementation on a machine
|without stack space could use some other memory.  There shouldn't be a
|requirement that the same compression code work on different machines;
|only that if a different machine is supported, the same functionality
|is provided.
|
|Allowing programs to use different compression routines with a
|standard interface seems to me to be a worthy goal.  Specifying
|details about where the memory comes from seems needlessly machine
|specific.

I agree that there are (at least) two separate goals and that I have
not adequately separated them in my mind. I do, however, think that
the standard should address machine portability.

When I framed the standard, I realized that memory procurement was
highly MACHINE and environment specific (e.g. stack vs heap) and also
messy and so I decided that the only way to allow machine portable
data compression procedures was to let the user program (which can be
as non-portable as you like) work out where to get memory and what to
do if it is not available. This solution transcended the problem.

The result was a proposed standard that would be totally portable
between machines, subject to the portability of the programming
language and here I agree that the responsibility here is on the
programmer to write quality code. This fact does not negate the
overwhelming portability benefits of forcing the user program to go
off and find memory.

Where I went wrong was on LANGUAGE portability, not machine
portability, for as Jean-loup Gailly pointed out, by requiring memory
to be handed to the data compression procedure, I was making life very
hard for implementations in certain high level languages and for
explicitly procedurally recursive algorithms.

Suppose that high level languages could pass around pieces of memory
(which the user program could obtain from the stack or the heap) which
could then be used to make type-safe variable declarations and
procedure invocations. I suspect then that Jean-loup's objections
would vanish and so would the stack/heap issue.

Thus, tightening up stack memory and other forms of memory was really
my way of disallowing memory leakage into the compression procedure
when a deal on memory was supposed to have been made through the
parameter list. As it stands, my proposed standard does not specify
where memory comes from, only that the user program obtain some. But
the standard trips on the language(/memory) issue.

CONCLUSIONS: The proposed standard as it stands is totally flexible
about where memory is obtained, but it leaves it up to the user
program. However, feeding all the memory in through the parameter list
causes such problems in high level languages that I agree that
compression procedures should be allowed to get their own memory. Once
this decision is made, the flood gates open and we have to deal with
1) stack vs heap, 2) memory chunk sizes and heap fragmentation, 3)
failure to allocate. Under these conditions, a restriction on the
stack space used (except for the one promised by the algorithm itself)
is restrictive.

Perhaps the algorithm could advertise how much heap and how much stack
it will use.

----------------------------------------------------------------------

|From: ronald@ecl014.UUCP (Ronald van Eijck)
|
|How about this:
|
|You have a standard command 'archive' (or whatever) with a standard syntax
|and a standard file format. in this format the filename etc are stored
|and there is an idenification code for every compression scheme.
|Now you have shared libraries that do the actual encoding/decoding. This
|would be a good solution for everybody.
|- The user has to know only one command syntax to do his work.
|- Archive shells do not have to support a lot of different syntaxes.
|- The programmer of a future compression scheme only has to make a library
|  with the encoding/decoding routines.

As far as I can tell, what you are proposing boils down to:

   1) The creation of a user-interface standard.
   2) The creation of a compressed file format standard.
      (including algorithm ids and so on).

Both good ideas for general purpose file compression. However, what I
am plugging for at the moment is a standard interface to the actual
algorithms themselves. This is a necessary component of your scheme as
can be seen from asking the question "How do the algorithms in the
libraries talk to the shell code?"

If the interface standard is not separated from your scheme above,
then the data compression algorithms are forced to become
unnecessarily involved in management information (e.g. storing ids)
which will make them harder to extract and place in tight, specific
applications such as in embedded systems.

|Ofcourse there this standard code has to be ported to a number of platforms
|but since almost every archive programm is ported anyway this is not much
|of a problem.
|
|Identification codes have to be obtained from some central point where all
|the codes are registered.

No need. Just get everyone to generate random n-bit IDs. Make n large
enough so that a collision is extremely unlikely.

----------------------------------------------------------------------

|From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein)
|
|In article <873@spam.ua.oz> ross@spam.ua.oz.au (Ross Williams) writes:
|> I'm not  convinced. I  think that  allowing algorithm  parameters will
|> open up a  really disgusting can of  worms. I see no  harm in having a
|> library of exact tuned algorithms.
|
|I do.
|
|Consider, for instance, my yabba coder. There's not only the memory
|parameter -m, but a blocking test length -z and ``fuzz'' -Z that affect
|compression. Twiddling -z and -Z can produce up to 10% improvements on
|some types of files; there's no small set of best choices, and forcing
|users into fixed values of -z and -Z, let alone -m, would be insane.

The reason why I am so heavy on parameters I think stems from seeing
dozens of papers in which they are not specified properly. In
addition, there is a need for users not to be confused by parameters.
These are not strong arguments and I am being swayed on the parameter
issue (so long as there are defaults and the name of the algorithm is
attached to the defaults).

|> If a user wants to create another  tuning, he can just fiddle with the
|> source code  for the algorithm until  he is happy and  then fiddle the
|> identity record to create a new  algorithm.
|
|Uh-uh. This contradicts your stated goal of having the identification
|numbers mean something for comparison purposes.

Absolutely not, because if the user changed anything he would have to
(according to standard) generate a new algorithm name and ID.

Actually I have been thinking about the ID lately and there are many
attributes upon which the ID could be based:

   - Algorithm.
   - Compressed data format (i.e. id says who can decompress).
   - Implementation.
   - Parameter settings.

So I might end up with a composite ID. I'm not sure.

|If you want to achieve that goal, just insist that everyone name all the
|parameters (in my case, -m/-z/-Z) when quoting compression results. (It
|gets even hairier when you're measuring speed and memory: for yabba,
|you'd have to quote MOD and PTRS as well as the machine type.) Don't
|force parameterized coders into a parameterless model.

Well, maybe. But the algorithm designer has to commit himself on
parameters to some extent. SAKDC has 25 parameters and I can dial up a
multitude of adaptivities and memories and different order models.
However, users don't want this sort of choice. With too many
parameters, one is also open to the charge that one has tailored the
parameter values to the data (so called "overparameterization").

|> (Another  argument is  that many  algorithms can  be implemented  more
|> efficiently if their parameters are statically specified).
|
|Hardly.

Absolutely. You should read the other critics. I have you telling me
that dynamic parameters are not much less efficient than static ones.
Meanwhile, there's the mob who don't like passing a pointer to
read/write memory for the algorithm to use because they want to avoid
indirect addressing in their code and pull tricks with the address
range.

It is a fact that statically binding parameters allows lots of
optimizations. For example:

   * Elimination of code in the case where a parameter has the value
     1 (e.g. in LZRW1 when I decided to make the hash table of depth
     1 instead of depth n, resulting in the elimination of the code
     to compare the lengths of matches found in the history).

   * Loop unrollings (short ones).

   * Use shift operations for parameters that can be set to powers
     of two.

And so on.

Nevertheless, the fact that static bindings allow higher speed is no
argument alone against ALLOWING dynamic parameters for an algorithm
author could create a version with parameters along with a couple of
optimized parameterless versions. My main arguments against parameters
are:

   * They tend to confuse users who don't know how to set the
     parameters.

   * They must be algorithm specific.
     It is hard to see how a user program unfamiliar with the algorithm
     could make sense of them.

   * They can overflow and be out of range and all the other things.

|> The  user  of the  algorithm  doesn't  have  to  deal with  oodles  of
|> information to USE  an algorithm. Only if they  want a non-implemented
|> tuning.
|
|One advantage of a program-based interface is that the user is given
|options. If he doesn't know about the options or doesn't want to deal
|with them, he doesn't use them, but they're always there for
|sophisticated users. You could mimic this in your library routine by
|passing in an array of options, say of { int, int } pairs, with a
|method-dependent meaning.

The idea of having default settings and then allowing the user to
specify otherwise at his own risk dooesn't sound so bad so long as:

   * The settings do not affect memory consumption.
   * The compressor codes the settings into the compressed file
     so that the decompressing user doesn't have to know what
     the settings were.
   * People quote the settings they change.

|  [ streams versus blocks ]
|
|I think the real argument here is over whether the compressor should
|drive the I/O or vice versa. In modem applications, your standard is
|fine: the I/O has to drive the compressor. But in most applications it's
|much easier to work the other way round.

This may be so, but placing the compressor in charge means that you
have to tell it how to do IO, which is so messy in most programming
languages (requiring Ada like generic superroutines or C pointers to
IO functions) as to as to completely destroy the standard. On the
other hand, such ego centred compression code can nearly always be
recoded into procedure+instance_block because the context information
nearly always consists of just a PC, a few buffer pointers and
possibly an explicit data stack.

I'm not saying that there won't be a little pain in converting some
algorithms over to the standard (all good standards cause lots of pain
:-), but I am sure that there will be very little loss of efficiency
and great benefits.

|> |>    DIRECTNESS: A stream interface will force many algorithms to become
|> |>    involved in unnecessary buffering.
|> |Your block interface forces *all* algorithms to become involved in
|> |unnecessary blocking. Again it sounds like you're concentrating on
|> |zero-delay modem compression. I don't think compressors should be forced
|> |into that model.
|> The interface does  not force the algorithms into  blocking, it forces
|> the user programs into it.
|
|No, it also forces the algorithms into blocking. Many algorithms don't
|want to deal with blocks. They want to deal with streams. Here, you can
|notch this up as a separate criticism: You expect algorithms to have
|enough special cases that they work well if you give them blocks of
|three bytes at a time. But most real LZW-style compressors work horribly
|on such short blocks. Most compressors do not support plaintext or
|escape codes, and it is unrealistic to expect otherwise.

You misunderstood me. In my original proposal, I assumed that it was
reasonable to expect user programs to perform buffering enough to be
able to pass largish blocks to the compression procedure. The benefits
of this choice were 1) the user program got to choose the buffer
lengths based on how much memory was floating around, 2) The
compression procedure would not have to concern itself with buffering
across calls as it could assume that the user had given it a
"reasonable" sized block. I never expected the algorithms to work well
on very small blocks. My idea of a small block was 4K.

Following your earlier criticism, I now realize that a block interface
has the flaw of imposing a block structure ON THE DATA because it
forces the user program to add lengths into the compressed data. This
means that the interface will not be able to be used in older stream
compression systems (because of compatibility problems). I didn't mind
imposing a bit of buffering on the user (so as to give the nice clean
memory no-flush-flag interface) but I now realize that it is
not acceptable because the blocking is imposed on the data as well.

I am now in favour of modifying the standard to have a flush flag (see
justification document) and to impose a limit on the buffering
somehow. This would allow any algorithm to be used in block fashion or
stream fashion.

|> I am wary of a stream interface because it is then impossible to place
|> a bound on the amount of output one could get from feeding in a single
|> byte of  input (during either  compression or decompression).
|
|Again it sounds like you're focusing on modems. Why?

I'm NOT focussing on modems - you're the one who keeps mentioning them!!!
My logic goes something like this:

   - We NEED to make the algorithm a procedure (not a process).
   - We NEED to use memory blocks to feed bytes in and out.
     (although the blocks can be part of a stream interface).

(I hope you agree that the alternatives to these are horrific)

THEN:

As the algorithm is writing to an output block of memory, how do we
know how much memory to give it? We could feed in 100K in one
byte blocks and not get anything out, and then on the next call (in
which we pass another one byte block) receive 63479 output bytes.
Regardless of whether a block or a stream model is used (and I am now
in agreement with stream), one needs to place a bound on the
amount of buffering the algorithm is allowed to do or else there
is no way to allocate the output buffers.

Note: Use of memory buffers for an algorithm interface doesn't mean
that the user can't read and write the buffers to files, and with a
stream (but NOT process) interface there is no imposition on the data
format.

----------------------------------------------------------------------

|From: gtoal@tharr.UUCP (Graham Toal)
|
|This rather lengthy document sounds just like my suggestion last month
|for putting a wrapper round compression algorithms with a blocked
|interface - the blocking allowing fast seeking to random points within
|a compressed file.

Your idea is not what my proposed standard is about. My standard
merely seeks to define an interface between data compression
algorithms and the programs that wish to use them. The memory block
interface was suggested merely to simplify the interface to the
algorithm but has come under fire (and rightfully so) because it does
not admit stream compressors as a special case.

I missed your posting on the random interface blocking thing but it
sounds just like the idea I had in March 1989 to develop an invisible
file compression layer using blocking. I pursued the idea, developing
data structures for the file conversion layer and developing the fast
LZRW1 algorithm (specifically for this purpose -- hence the fast 68000
implementation), but for commercial reasons it never made it into
product form.

|The basic idea is fine but do we need 500+ lines of pretentious twaddle
|to describle it?

Yes we do. Absolutely. I like pretentious twaddle.

Without hundreds of lines of pretentious twaddle details get missed
and stuffups occur. In computing, one trivial detail can sink a
program of 10^6 lines of code. As a mobster would say: NO LOOSE ENDS!

----------------------------------------------------------------------

|From: gudeman@cs.arizona.edu (David Gudeman)
|
|I have to agree with Dan (readers of comp.lang.misc are probably
|experiencing a feeling of unreality right now...) that parameters are
|important.  Any standard that neglects them will be ignored by
|everyone who implements an algorithm with natural parameters.

Note: I don't read comp.lang.misc.

I am being swayed. But there would have to be default values.

|How about the following: a compression procedure with parameters will
|use a global variable for each one (using some naming convention based
|on the name of the procedure).  The user's program does not have to
|refer to the global variable at all, and if it doesn't then some
|natural default is used.  When you name such a procedure for
|comparison to other procedures, any parameters that are not mentioned
|are assumed to be the defaults.

I see no reason to introduce global variables. The global variables
will not be accessible to user programs that have linked in
dynamically and know only of the procedure to interface to. See also
next answer.

|One advantage of this for C is that in those cases where you want to
|hard code the parameters you can #define them in the source of the
|compression function without changing the function's formal
|parameters.  (It seemed that Dan doesn't believe there can be an
|advantage in hard coding the parameters, which I find peculiar.  There
|are many optimizations that can be done with known values over
|variables.)

If the user has the source code, he can do anything anyway. The real
case for parameters lies in the case of someone who only has an
executable.

|As to the stream/block problem, I suggest that it is unreasonable to
|try to force exactly _one_ interface on all algorithms and
|applications.  Obviously there are applications for each and the
|standard should apply to each.  In other words, there should be a
|standard stream interface _and_ a standard block interface.  Let the
|aplication programmer choose whichever interface is most appropriate
|and let the compression implementer deal with any problems the given
|compression procedure may have with either interface.  Add a field to
|the identity records that says whether the algorithm works "best" with
|streams, blocks, or either.

I think having two interfaces would be a big mistake. And unnecessary
too as I now believe that one interface can have both block and stream
as special cases (the stream is the more general).

|I'd also like to repeat someone else's suggestion of seperating the
|compression procedure and decompression procedure.  I don't understand
|how this effects having a compression method as a first-class
|function.  So you have two first-class functions for each compression
|method instead of one, what's the big deal?  For that matter, I'd add
|a third function -- the one that returns the identity record.  I don't
|see any value to having this be part of the compression function
|itself.

The big deal is that having two or three sets of functions instead of
one makes manipulating algorithm procedures as first class objects
that much more messy. The reduction from n>1 functions to n=1
functions is a significant one because it means that user's no longer
need a structure or record structure for each algorithm, just a
pointer.

Backing up this argument is that two of the three different calls use
almost identical parameters.

----------------------------------------------------------------------

From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein)

|In article <4496@optima.cs.arizona.edu> gudeman@cs.arizona.edu (David Gudeman) writes:
|> How about the following: a compression procedure with parameters will
|> use a global variable for each one (using some naming convention based
|> on the name of the procedure).  The user's program does not have to
|> refer to the global variable at all, and if it doesn't then some
|> natural default is used.
|
|I expect that Ross will disapprove of this because he can't pass the set
|of parameters around as arguments if they're global variables. If you
|have an array argument giving any number of { parameter, value } pairs
|in any order, this problem disappears.

Agreed. If the standard does allow parameters, they must go through
the front door. What do you think of a string interface? E.g:

compress(comp,mem,inblk,outblk,"M=4,K=91,Z=512");

It would require a little parsing by the algorithm (on the
initialization call only) but would solve lots of format problems. The
string format could be semi-standardized.

|> (It seemed that Dan doesn't believe there can be an
|> advantage in hard coding the parameters, which I find peculiar.  There
|> are many optimizations that can be done with known values over
|> variables.)
|
|It depends. Parameters like hash table size (and hash function!) have to
|be hard-coded if you want any efficiency, as do ``parameters'' like the
|type you use for storing hash values. But I've never seen a parameter
|that affects compression and has to be hard-coded. (You get about a 0.4%
|speedup by specifying a single value of ``min'' in yabba, for example.)
|Finally, any good language lets you say ``let foo() be a version of
|bar() where argument x is 57'' and optimizes that independently.

1) I don't understand.
2) Nothing HAS to be hard coded.
3) A compiler can only do so much with static parameters.

|> As to the stream/block problem, I suggest that it is unreasonable to
|> try to force exactly _one_ interface on all algorithms and
|> applications.
|
|Exactly. I imagine modem manufacturers would love this standard, for
|example, while UNIX users and programmers would love a program-based
|standard. Here's a challenge, Ross: Rewrite compress to conform to your
|standard. If you can do that, publish it and explain why it's useful.
|I'm always wary of ``standards'' that haven't been given at least two
|unrelated implementations.

YOU'RE the one with modems on the brain. I just chose the block
interface because it makes the standard simple (avoids superroutines
and other horrors - see earlier).

I now agree that the imposition of the blocking of my standard on the
DATA format makes this impossible (unless it can be assumed that the
program is always allowed to allocate two blocks of memory the size of
the file it is about to compress in which case it is TRIVIAL).

Note well: If I were allowed to use a slightly different data format
(in which compressed block lengths were inserted in the compressed
data), and allowed a tiny (e.g. 0.00001) decrease in compression
performance (for the flush (not context zap)) at the end of every
large (e.g. 100K) block, there it would be EASY. My point here is that
the standard fails here not because of objective convenience, but
because it fools with the data format. The revised stream/block
version won't.

I really don't think there would be any problem with implementing UNIX
compress if the flush flag were restored. As I understand it, the LZW
algorithm's context is very simple consisting of just a table, an
index and possibly one or two coding scalars.

As a separate point, I don't see how a program-based standard could be
specified at this level without assuming that languages have processes
or superroutines or procedures as parameters.

|> I'd also like to repeat someone else's suggestion of seperating the
|> compression procedure and decompression procedure.
|
|Ditto.

But why? If it were just a static module exporting functionality
I would agree. But this is a situation where programs may wish to
keep arrays of these algorithms. Why mess it all up when the
parameter lists of the calls are nearly the same?

----------------------------------------------------------------------

|From: gtoal@tharr.UUCP (Graham Toal)
|
|In article <3406.Jun2302.12.0491@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
|:Exactly. I imagine modem manufacturers would love this standard, for
|:example, while UNIX users and programmers would love a program-based
|:standard. Here's a challenge, Ross: Rewrite compress to conform to your
|:standard. If you can do that, publish it and explain why it's useful.
|:I'm always wary of ``standards'' that haven't been given at least two
|:unrelated implementations.
|:
|:> I'd also like to repeat someone else's suggestion of seperating the
|:> compression procedure and decompression procedure.
|:
|:Ditto.
|
|That's exactly what I was trying to say when I referred rather tactlessly
|(apologies, Ross) to 'pretentious twaddle'.  Imposed standards just don't
|work.  Go ahead and implement the ideas you're suggesting, publish them
|in source form on the net, and let the market decide.  I've a *lot* of
|time for Dan because instead of moaning about compress, he went out and
|implemented something that works.  (Whether it will replace compress is
|something I'm sitting back and hopefully watching for...)

Imposed standards can and do work all the time (witness Ada, CDs). The
only ones you see failing are the ones that were so brain dead in the
first place that there was a mass revolt. Most standards have
problems, but at least they give you something to kick off.

I feel that the criticism stating (effectively) that I am not a DOer
is highly unfair. I COULD have posted to the net moaning about the
lack of a standard but instead I put in the time to create one.
Furthermore, I didn't just trust my judgement, I posted my standard to
the net to be criticised and modified. The result has been very
positive and with some of the modifications suggested I think we will
have a viable standard soon.

I don't see how I can "IMPLEMENT" my standard. I get the impression
that you think that in order for me to do that, I would have to
implement a work harness, a validation harness, and create conformant
implementations of dozens of algorithms. Well I'm not going to do that
and I don't think that I should have to. The standard is just
something that will exist and if people want to ignore it or propose
other standards then they can.

(As it happens I have been using a standard similar to the one
proposed (see lzrw_headers.h in my ftp archive on
sirius.itd.adelaide.edu in pub/compression) for two years and have
implemented five or six algorithms using it (LZRW1 is such an
algorithm). These algorithms were implemented with the standard in
mind though.)

I believe that if the flush flag (see justificication document) were
reinstated, it would be very easy to develop a conformant version of
Unix compress and I don't think I have to demonstrate it. However, if
you find it hard to see how it can be done I might be forced to
develop it.

|Incidentally, I've implemented something much like you describe.  For it
|to be portable, you have to (yuk - I do it but I hate it) mandate that
|none of the data structures used by compression or decompression are
|larger than 32k bytes, even if this reduces the compression you get.

My standard proposal does not limit the memory available to a
compression algorithm. Instead it requires the algorithm to specify
how much it needs.

If your 32K limit is because of the problem of placing an upperbound
on buffering (so that memory blocks can be used for the input and
output to the algorithm), then you should restate the condition. There
is no need to limit the memory of the algorithm, only the degree of
buffering that it does.

|And seperating out compression and decompression procs is worthwhile.

People keep saying this but nobody will tell me why.
Is it one of the ten commandments or something?

I realize that it generally neater to have three separate procedures,
but given the similarity of the parameter lists and the potential ease
of manipulation as a first class object I think it is a clear
decision.

|The biggest thing a standard could do for portability is this:
|   1) Define the file format as a series of blocks; each block can
|      be compressed with an arbitrary algorithm to a fixed interface.
|   2) Define the procedure interface exactly (eg decomp(inblock,outblock,
|         insize, outsize) -> status) and arrange for the encompassing
|      compressed file TO INCLUDE THIS C CODE so that any decoder can
|      extract it, link it in, and use it to extract the data.
|
|This might seem a stupid suggestion but it's the only one I can
|think of which will allow backwards and forwards compatibility.  otherwise
|you get into the pkzip thing of rushing around tracking newer versions
|of algorithms and having decompression programs with twenty different
|algorithms built in.

This is the most general way of encoding data - as a computer program.
However, it does discriminate against enormous compression algorithms
such as SAKDC.

It would have to be a very strict, portable subset of C.

Anyway, as I said in an earlier reply, I am not trying to establish a
high level archiving/interface/data-format standard; I am just trying
to get a procedural interface standard going, which you admit in the
above that you will need anyway (although if the decode can compile
and link, why can't it run the code as well. Then the decompression
program can be a full unix C program.).

|PS Ross, I probably wouldn't have shot from the lip in that last post
|if you'd titled it 'request for comments' or even 'draft standard'...

I have been using "standard", "standard proposal", etc in an
undisciplined manner. I do not expect it to be adopted instantly. I
just want to develop something that everyone can agree on. I believe
that as soon as the standard is agreed on, new algorithms will pop up
in that format and we can start collecting them and comparing them.

----------------------------------------------------------------------

|From: mjr@hussar.dco.dec.com (Marcus J. Ranum)
|
|        Maybe I missed it, but is there any consideration given to some
|kind of magic numbers to help identifying the proper means of decompressing
|compressed output produced by one of the "standard" compressors?
|
|mjr.

Yes. The designer of each algorithm tosses a coin 31 times to arrive
at an identity number which the algorithm then advertises to user
progams which can then code the number in compressed data. The
standard does not attempt to define a data format (e.g. containing the
number) as it is concerned only with the interface between the
algorithm and the user program.

----------------------------------------------------------------------

--<End of this batch of replies to criticisms>--

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/28/91)

I don't mean to be cruel, Ross, but I suggest that we all ignore your
``standard'' until you have squashed two or three separate compressors
(LZW, AP, and LZRW1, say) into the same mold and published the results.
This is how all standards should work: *Do something first*, and publish
it. Then tell us about the important parts of the interface, cutting out
irrelevant details that might take up a lot of code and spending time on
important restrictions that don't involve any single section of code.
Say ``the routine must do blah'' if you think blah is necessary for the
user. Say ``the user must not do foo'' if you don't think anyone will
ever want foo in routines written to the standard. Say ``the routine may
or may not do bar'' if you did it both ways. Voila, a standard.

None of this goes anywhere unless you first publish working code.

In article <899@spam.ua.oz> ross@spam.ua.oz.au (Ross Williams) writes:
  [ memory ]

The obvious solution is for the compression routine to provide its own
memory. This doesn't fit well with I/O-driven compression, though, or
with non-parametrized standards.

  [ pointers ]

Face it, Ross: There are a whole lot of compressors which cannot be
written efficiently in Pascal or even Ada. ``An interface that even
fools can use will be used only by fools''; well, by making your
standard wimpier and wimpier so that the routines can be run on wimpier
and wimpier machines in wimpier and wimpier languages, you pretty much
guarantee that it will only be used that way. If you're going to write
the standard before the implementation, you might as well take a lesson
from POSIX: Use pointers where they obviously belong, don't make design
decisions based on how difficult you think the interface will be to use
in Ada, and postpone other language bindings to the next decade.

> Perhaps the algorithm could advertise how much heap and how much stack
> it will use.

An excellent suggestion. (Naturally, my yabbawhap package includes a
``checkconf'' program which advertises this information, given the
compile parameters.)

> |Ofcourse there this standard code has to be ported to a number of platforms
> |but since almost every archive programm is ported anyway this is not much
> |of a problem.
> |Identification codes have to be obtained from some central point where all
> |the codes are registered.
> No need. Just get everyone to generate random n-bit IDs. Make n large
> enough so that a collision is extremely unlikely.

Great. Now IDs are almost totally useless. What are you trying to
accomplish with those numbers?

  [ parameters ]
> The reason why I am so heavy on parameters I think stems from seeing
> dozens of papers in which they are not specified properly.

Writing a crippled compression standard is a poor way to express your
dissatisfaction with how papers are written.

> |> If a user wants to create another  tuning, he can just fiddle with the
> |> source code  for the algorithm until  he is happy and  then fiddle the
> |> identity record to create a new  algorithm.
> |Uh-uh. This contradicts your stated goal of having the identification
> |numbers mean something for comparison purposes.
> Absolutely not, because if the user changed anything he would have to
> (according to standard) generate a new algorithm name and ID.

Sorry I didn't make this clear. I publish yabba with a default of
-m65533. I give it some ID. Joe Shmoe changes the default to -m300000
and gives it another ID. Sally Shmoe changes the default to -m300000 and
gives it yet another ID.

Now if Joe and Sally choose their IDs randomly, how do we tell that
they're using the same method? Or, for that matter, that they've just
changed a parameter in my method?

> |If you want to achieve that goal, just insist that everyone name all the
> |parameters (in my case, -m/-z/-Z) when quoting compression results. (It
> |gets even hairier when you're measuring speed and memory: for yabba,
> |you'd have to quote MOD and PTRS as well as the machine type.) Don't
> |force parameterized coders into a parameterless model.
> Well, maybe. But the algorithm designer has to commit himself on
> parameters to some extent. SAKDC has 25 parameters and I can dial up a
> multitude of adaptivities and memories and different order models.

I fail to see how this matters. The algorithm is parametrized. Period.

> |> (Another  argument is  that many  algorithms can  be implemented  more
> |> efficiently if their parameters are statically specified).
> |Hardly.
> Absolutely. You should read the other critics. I have you telling me
> that dynamic parameters are not much less efficient than static ones.
> Meanwhile, there's the mob who don't like passing a pointer to
> read/write memory for the algorithm to use because they want to avoid
> indirect addressing in their code and pull tricks with the address
> range.

Right. Remember, we're talking about *can be implemented*, not can be
implemented under your awful set of restrictions.

Someone might have a compression method where he compares n to a
parameter, x, in the innermost loop. Does that mean that the algorithm
can be implemented more efficiently if their parameters are statically
specified? No, because every use of n can be ``flipped'' into x - n, and
this can be compared against 0.

Sure, there will exist parameters (like the hash function) where you
just can't do something like this. You still don't have to make those
parameters fixed. You can just tell the compiler to optimize out a
separate version for those parameter values:

   int my_special_hash(h,c) int h; int c; { ... }

   static int generic_routine(input,hash)
   char *input;
   int (*hash)();
   {
    ... all the compression code, parameterized on ``hash'', goes here ...
   }

   static int optimized_routine(input)
   char *input;
   {
   #pragma begin-inline-fun generic_routine
    return generic_routine(input,my_special_hash);
   #pragma end-inline-fun generic_routine
   }

   int user_visible_routine(input,hash)
   char *input;
   int (*hash)();
   {
    if (hash == my_special_hash)
      return optimized_routine(input);
    return generic_routine(input,hash);
   }

Of course, this is oversimplified, and with current C compilers you have
to stick real_routine inside one big, ugly (but portable) macro instead
of using pragmas. But it works, it avoids code duplication, and it gives
you all the benefits of fixing the parameter to begin with. In fact, if
the compiler inlines something like

   user_visible_routine(myinput,my_special_hash)

in the user code, it will end up *exactly* as efficient as if
my_special_hash had been used everywhere in the first place. The only
disadvantage is the extra space needed for generic_routine, and that can
be avoided if you stick the routines inside a library to begin with.

I believe the MIPS compilers can do all of the above, including inlining
from ucode libraries. They don't have the #pragma but that's just
syntactic sugar. Even under good old pcc, all you lose for the
parametrization is a comparison and an extra function call on each call
to the routine.

> It is a fact that statically binding parameters allows lots of
> optimizations.

Indeed... I don't complain that something can't be done. I just do it.
Are you so sure now that taking parameters out of the interface allows
better optimization?

>    * They tend to confuse users who don't know how to set the
>      parameters.

As long as every parameter has a default, this is moot.

>    * They must be algorithm specific.

So what?

>    * They can overflow and be out of range and all the other things.

True. Another advantage of a program-based interface is that it can take
the time to check such things.

> The idea of having default settings and then allowing the user to
> specify otherwise at his own risk dooesn't sound so bad so long as:
>    * The settings do not affect memory consumption.

There you go again.

How about you try to fit compress into this model. If you manage to
write one version which doesn't take too much memory for -b12, but
allows -b16, but doesn't have a memory-affecting parameter, and doesn't
allocate memory itself anyway, I'll be very impressed.

  [ compressor driving I/O versus I/O driving compressor ]
> This may be so, but placing the compressor in charge means that you
> have to tell it how to do IO, which is so messy in most programming
> languages (requiring Ada like generic superroutines or C pointers to
> IO functions) as to as to completely destroy the standard.

See my general comments near the top. You have that choice: either use a
language which can compete with machine language, or make your standard
so wimpy that nobody will want to use it.

> I'm not saying that there won't be a little pain in converting some
> algorithms over to the standard (all good standards cause lots of pain
> :-), but I am sure that there will be very little loss of efficiency
> and great benefits.

So do it for compress. I want to see something that works just like
compress but uses your standard for an internal interface.

> |> I am wary of a stream interface because it is then impossible to place
> |> a bound on the amount of output one could get from feeding in a single
> |> byte of  input (during either  compression or decompression).
> |Again it sounds like you're focusing on modems. Why?
> I'm NOT focussing on modems - you're the one who keeps mentioning them!!!
> My logic goes something like this:
>    - We NEED to make the algorithm a procedure (not a process).
>    - We NEED to use memory blocks to feed bytes in and out.
>      (although the blocks can be part of a stream interface).
> (I hope you agree that the alternatives to these are horrific)

No, I don't agree with either one. The argument here is about the
second. You only need a bound on output because you have a block
interface, and unless you wanted to handle modems you could use a stream
interface instead. There's nothing horrific about ignoring modems and
sticking to a stream interface.

> |One advantage of this for C is that in those cases where you want to
> |hard code the parameters you can #define them in the source of the
> |compression function without changing the function's formal
> |parameters.  (It seemed that Dan doesn't believe there can be an
> |advantage in hard coding the parameters, which I find peculiar.  There
> |are many optimizations that can be done with known values over
> |variables.)
> If the user has the source code, he can do anything anyway. The real
> case for parameters lies in the case of someone who only has an
> executable.

See above. It doesn't matter whether you have the entire source or just
a library. Hard-coding parameters barely speeds anything up.

> Agreed. If the standard does allow parameters, they must go through
> the front door. What do you think of a string interface? E.g:
> compress(comp,mem,inblk,outblk,"M=4,K=91,Z=512");

Not too bad.

> I really don't think there would be any problem with implementing UNIX
> compress if the flush flag were restored. As I understand it, the LZW
> algorithm's context is very simple consisting of just a table, an
> index and possibly one or two coding scalars.

Well, don't just understand it. Do it. Then tell us what you did.

> As a separate point, I don't see how a program-based standard could be
> specified at this level without assuming that languages have processes
> or superroutines or procedures as parameters.

Hmmm? What do you mean by that?

> |> I'd also like to repeat someone else's suggestion of seperating the
> |> compression procedure and decompression procedure.
> |Ditto.
> But why?

Because if the procedures are in a library, separating them means you
can load only one into a program which needs to do only one thing. AP
coding (as in whap) makes a good example: the compressor is rather
complex and takes quite a bit of code space, but the decompressor is
trivial. A lot of programs just want to read compressed files...

> Imposed standards can and do work all the time (witness Ada, CDs).

Huh? Ada was indeed standardized before it was implemented, and it is a
royal flop. Very few people use it. The CD ``standard'' is just a
description of what people are doing; it works because it was
implemented first.

> Most standards have
> problems, but at least they give you something to kick off.

No. Most *implementations* have problems---most importantly,
incompleteness---but at least they do something positive. Standards are
negative. They restrict you, and their main failure is being overly
complete.

> I don't see how I can "IMPLEMENT" my standard. I get the impression
> that you think that in order for me to do that, I would have to
> implement a work harness, a validation harness, and create conformant
> implementations of dozens of algorithms.

I don't know about Graham. I think you should just prove that the
established compressors can be recoded into your interface without too
much work.

> (As it happens I have been using a standard similar to the one
> proposed (see lzrw_headers.h in my ftp archive on
> sirius.itd.adelaide.edu in pub/compression) for two years and have
> implemented five or six algorithms using it (LZRW1 is such an
> algorithm). These algorithms were implemented with the standard in
> mind though.)

Right. Which is why you should choose something like LZW, which does not
put a bound on output (well, 8x or so in theory, but not a fixed number
of bytes), and would have a much smaller user base if it were not
parametrized, and is naturally a stream algorithm.

  [ separating compressor and decompressor ]
> Is it one of the ten commandments or something?

Yes.

> I have been using "standard", "standard proposal", etc in an
> undisciplined manner. I do not expect it to be adopted instantly. I
> just want to develop something that everyone can agree on.

I don't care whether you call it a ``standard'', a ``proposed
standard'', a ``document'', a ``suggestion'', or an ``article''. I am
not going to write code to someone else's interface unless I can see
what its strengths and weaknesses are *in practice*. (Naturally, if you
develop a large enough body of code that takes advantage of the
interface somehow, that's a strong point.)

---Dan

jloup@nocturne.chorus.fr (Jean-Loup Gailly) (06/28/91)

ross=ross@spam.ua.oz.au (Ross Williams)
dan=brnstnd@kramden.acf.nyu.edu (Dan Bernstein)

dan> [ streams versus blocks ]
dan> I think the real argument here is over whether the compressor should
dan> drive the I/O or vice versa. In modem applications, your standard is
dan> fine: the I/O has to drive the compressor. But in most applications it's
dan> much easier to work the other way round.

ross> This may be so, but placing the compressor in charge means that you
ross> have to tell it how to do IO, which is so messy in most programming
ross> languages (requiring Ada like generic superroutines or C pointers to
ross> IO functions) as to as to completely destroy the standard.

I agree with Dan. When the compressor drives the I/O, it keeps its internal
state when it outputs a block of compressed data and continues with the rest of
the input data. If the compressor is forced to stop when the single output
buffer is full, it has to save its internal state explicitly in the MEMORY
parameter suggested by Ross. If the compressor state is the middle of a deeply
nested loop, it becomes messy to restore this state at the next invocation of
the compression routine. (The same arguments apply to decompression.)

Even for modems, I don't see why the compressor cannot drive the I/O. This may
require some buffering of the data to be compressed sent by the modem, but the
OS usually takes care of this. If the compressor is so slow that it cannot
handle the input rate on the average, you are in trouble anyway.  If we ignore
the details about setting the tty modem line correctly and determining the end
of the input data, a command such as

  compress < /dev/ttyxxx > foo

should work. In this example, compress drives the I/O.

You don't even need pointers to IO functions. The compression/decompression
algorithm can directly call functions readblock/writeblock which would have an
interface similar to that of the Unix read/write calls and would be well
specified by the compression standard. The compressor can easily construct
macros (sorry, inline functions) on top of these functions, similar to the
getc/putc macros.  For a modem, readblock would return whatever bytes have
already been received. For many algorithms, this interface is much easier to
use than that proposed by Ross.

Also, Ross justifies his unique function for compression/decompression by an
array of function pointers: "this is a situation where programs may wish to
keep arrays of these algorithms". Either you do or you do not have pointers
to functions, but you cannot have it both ways.

Although I said above that it is preferable to have the compressor drive the
I/O, this does make it more difficult to try multiple compression algorithms
concurrently. If your language does not give you concurrent threads or
coroutines, the easiest solution is to read the input file once per algorithm.
But this is costly for very high speed algorithms which are IO bound. Also,
this is not applicable if the input data must be read only once (modem or
pipe). Dan, do you have a suggestion for this?

Jean-loup Gailly
jloup@chorus.fr

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/29/91)

In article <11228@chorus.fr> jloup@nocturne.chorus.fr (Jean-Loup Gailly) writes:
> Although I said above that it is preferable to have the compressor drive the
> I/O, this does make it more difficult to try multiple compression algorithms
> concurrently. If your language does not give you concurrent threads or
> coroutines, the easiest solution is to read the input file once per algorithm.
> But this is costly for very high speed algorithms which are IO bound. Also,
> this is not applicable if the input data must be read only once (modem or
> pipe). Dan, do you have a suggestion for this?

Well, one advantage of passing in read-write function pointers to the
compression routine is that you can substitute routines to read from
memory instead. So the I/O problem disappears. (Actually, that may be a
bit too optimistic: the UNIX read-write interface demands an extra copy
into the buffer, and doesn't even allow for truly asynchronous I/O. I've
never seen a compressor that wants to write to an input block, so the
copies could, in principle, be avoided. The read-write interface thus
involves up to a 5% loss in speed on a Convex [trust me, I've measured
it!], probably around 15% on a Cray, and about 2% on a Sun 4. Boo hoo.)

---Dan