[comp.compression] Justification of proposed interface standard.

ross@spam.ua.oz.au (Ross Williams) (06/20/91)

NOTES ON THE DESIGN OF THE PROPOSED DATA COMPRESSION INTERFACE STANDARD
-----------------------------------------------------------------------
Author : Ross Williams.
Date   : 20-Jun-1991.

CONTENTS
 1. Introduction
 2. Typical Applications
 3. Control Structure
 4. Input Structure
 5. Flushing
 6. Buffering
 7. Definition of an Algorithm
 8. Working Memory
 9. Data Format
10. Generalizing the Interface
11. Final Note


1. Introduction
---------------
This document attempts to justify a selection of design decisions that
determined the proposed data compression interface standard.

2. Typical Applications
-----------------------
The following typical applications of data compression serve to ground
the  standard in  the practical.  The standard  was framed  with these
applications in mind.

UNIX FILTER COMPRESSOR: A Unix data compression program reads a stream
of bytes from standard input with  no foreknowledge of when the stream
is going to end, and writes  a compressed representation of the stream
of bytes to standard output.  The compression algorithm is implemented
as a process which has complete control over its control flow and IO.

MEMORY BLOCK  COMPRESSOR: A word  processing program divides  the text
being edited  into blocks  and stores each  block in  compressed form,
decompressing it  on demand. The  data compression algorithm  is fed a
block of bytes and is  expected to produce a compressed representation
of that block.

MODEM COMPRESSOR: An error correcting modem employs a data compression
algorithm  implemented  as  an  abstract data  type.  Sometimes,  when
screens of data are to be transmitted,  a block of 2K is presented for
compression.  At  other  times,  when  a  user  keystroke  has  to  be
transmitted without delay, a single byte is presented for compression.
In  both cases,  the resulting  representation  must be  coded in  the
context  of the  previous bytes  compressed. The  modem instructs  the
algorithm to forgets its context between connect sessions.

BUFFERED FILE COMPRESSOR: A file  compression program compresses a set
of files  independently. It divides each  file into blocks of  50K and
feeds  the  blocks  in  succession to  a  data  compression  algorithm
implemented as an abstract data type. The algorithm must compress each
block in the context of previous blocks transmitted.

3. Control Structure
--------------------
The  applications  described above  differ  widely  in the  degree  of
control and  data retention capability  given to the  data compression
algorithm. Control ranges from tightest to loosest as follows:

   * Procedural         Example: Block.
   * Abstract data type Example: Modem,File.
   * Process            Example: Unix.

In  defining a  standard,  a particular  level of  control  had to  be
decided upon. This was a fairly easy decision.

Procedural organizations do not allow  a data compression algorithm to
retain information between invocations which makes them unsuitable for
context stream applications.

Process organizations  are very general  but are messy and  awkward to
implement  in  languages  with  only   one  thread  of  control  (most
languages).

In choosing  a control structure  the abstract data type  structure is
the clear winner. An ADT structure  allows context to be preserved but
does not wrench control from the user program.

4. Input Structure
------------------
The decision  made in  section 3  implies that  the user  program will
create blocks of  data and invoke a compression  procedure to compress
the data. The  compression procedure will be passed a  block of memory
which it may use to create a model of the data and to pass information
(such as the model) between invocations.

The next major  design decisions involves the  imposition of structure
on the sequence of data blocks.

   * CONTEXT  : When should the algorithm ignore its model and start afresh?
   * FLUSHING : When should the algorithm complete a decompressible block?
   * CHUNKS   : What should be the size of data blocks fed to the algorithm?

By "ignore  its model" is meant  cease to use any  knowledge about the
data already seen  to compress the data about to  be seen. By flushing
is  meant  writing   zero  or  more  output  bytes   to  complete  the
representation of the bytes already received for compression.

In the  general case, the  processing of a  stream of blocks  of input
data can be modelled as follows:

The input data consists of
   Zero or more CONTEXT blocks consisting of
      One or more CLOSED blocks consisting of
         One or more PIECE blocks consisting of
            Zero or more bytes.

                                     Context Blocks
  +----------------------------------------+ +------------------------+
  |           Closed Blocks                | |      (similar over     |
  | +----------------+ +-----------+ +-+   | |       here too)        |
  | | Piece Blocks   | |           | | |   | |                        |
  | | +--+ +--+ +--+ | | +--+ +--+ | | |   | |                        |
  | | |  | |  | |  | | | |  | |  | | | |   | |                        |
                                                                        ----->
      ^    ^    ^        ^    ^   <--- "Calls" to data compressor.      Input &
                     ^Flush coder buffer   ^Destroy model               Time

Piece blocks consist  of zero or more bytes. The  algorithm reads each
piece block and writes zero or more bytes to the output in response.

At the end of each context block, the compressor destroys its model of
the data and approaches the next context block with a clean slate.

At the  end of  each closed  block, the  compressor flushes  its coder
buffers. The output  produced during the processing of  a closed block
is a complete representation of the closed block.

At the  end of  each piece block  the compressor  writes zero  or more
bytes  to  the  output  but will  not  necessarily  flush  a  complete
representation of what it has read.

At this stage, we have an interface that looks roughly like this:

IN    action      - The action to be performed (identity,compress,decompress).
INOUT memory      - The compressor algorithm's ADT data (=activation record).
IN    inblock     - The input  block.
OUT   outblock    - The output block.
IN    context?    - Use the context from the previous block?
IN    flush?      - Flush the coder?

with the boolean  flags (marked by "?") indicating  when the algorithm
should discard its context and flush its coder.

5. Flushing
-----------
The input structure described in section 4 is general and will satisfy
most users.  However, it is  also a bit  messy. It is  inevitable that
programmers will be confused by the context and flush flags and subtle
bugs will be created. It would  be desirable to simplify the interface
somehow.

The  following table  shows  how the  four  applications described  in
section 2 make use of the input structure described in section 4.

         +--------------------------------------------------------------+
         |  Context Blocks      Closed Blocks      Piece Blocks         |
+--------+--------------------------------------------------------------+
| UNIX   |  One                 One                One byte long        |
| MEMORY |  One                 One                One per closed block |
| MODEM  |  Many                Many               One per closed block |
| FILE   |  Many                Many               One per closed block |
+--------+--------------------------------------------------------------+

This table  indicates that  the input structure  of piece  blocks (and
hence the flush flag) may not  be necessary. The only application that
uses  piece blocks  is the  Unix compressor  which could  quite easily
circumvent the problem using buffering.

In general, absence  of an optional-flush option  becomes awkward only
when the piece  block size is very  small (as in the case  of the unix
compressor).  If the  cost  of flushing  is  (say)  two bytes,  forced
flushing on piece blocks of 1024 bytes carries negligible cost whereas
forced flushing of piece blocks of 1 byte carries an enormous cost.

Additional reasons for the decision not  to use a flush flag are given
below:

   1.  The  unit of  compression  is  the  closed block.  Because  the
    standard  cannot  guarantee  that  even   the  first  byte  of  an
    uncompressed closed  block can  be made  available until  the last
    byte of the compressed representation of the closed block has been
    received,  the  piece  block  level is  seen  to  be  a  buffering
    mechanism which is better performed externally.

   2. The  user program is better  able to assess the  availability of
    memory for buffering.

SUMMARY:  Although  a   difficult  decision,  my  vote   goes  to  the
elimination  of  the  flush  flag and  the  compulsory  imposition  of
flushing at  the end of each  data block presented to  the compression
algorithm.

6. Buffering
------------
Various arguments can be raised for providing an interface that forces
the  compression algorithm  to  perform various  levels of  buffering.
Currently the algorithm  allows the user to specify the  size of input
blocks but not the size of output blocks. Some might desire a standard
in which the size  of the input and output blocks  can be specified by
the user with the algorithm buffering a stream in the middle.

The best argument  against buffering is that it  imposes an artificial
layer and  hides the closed  block structure. In  the end, it  must be
remembered that  data compression algorithms should  compress data and
that this means receiving some data and making it smaller. Any attempt
to hide the fact that the  data has become smaller (e.g. by buffering)
is a cosmetic  addition that (while useful) should  be located outside
the standard. Perhaps there could be second order standard for a shell
procedure that provides a soft buffered interface.

7. Definition of an Algorithm
-----------------------------
The definition of a data compression algorithm provided an interesting
challenge in the framing of the standard.

Authors  of  data compression  algorithms  typically  publish a  paper
describing the algorithm  in generic terms. The  "algorithm", which is
often  given  a  generic-sounding  name,  is  usually  specified  as a
parameterized  object  having  parameters  that  affect  the  tradeoff
between the following performance characteristics:

   COMPRESSION     : Compression yielded by the algorithm.
   MEMORY          : Memory required by the algorithm.
   SPEED           : Speed of the algorithm.
   CODE COMPLEXITY : Degree of difficulty to implement.
   ADAPTIVITY      : Speed at which the algorithm adapts to new data.
   GENERALITY      : Degree to which algorithm is tailored for specific data.

The  presence of  parameters and  the often  vague description  of the
algorithms themselves does  not help in the  comparison of algorithms.
Those  comparing algorithms  are often  forced to  pick parameters  in
ignorance  of  the  true  impact of  a  particular  parameter  on  the
performance measures listed above.

If the standard were to provide  parameters in the interface which the
user  program could  use to  dial  up various  versions of  particular
algorithms, the  parameters provided  in the  interface would  vary in
their semantic meaning from algorithm to algorithm.

To resolve  this mess, the  standard prohibits parameters.  This means
that generic algorithms that have parameters must be instantiated with
specific static parameter values before  they can be standardized. The
standard also forces the algorithm designer to specify how much memory
the algorithm will require.

The benefits here  are enormous. By fixing algorithms  in granite, the
standard provides  fixed points that  investigators can refer  to when
comparing  algorithms  or  classes  of algorithm.  By  requiring  that
algorithm  designers choose  a  good set  of  parameters before  being
allowed  to   standardize  an  algorithm,  the   standard  places  the
responsibility for parameter  choice in the hands of those  who are in
the best position to tune the algorithm.

8. Working Memory
-----------------
To  an  extent, the  arguments  given  above  for the  elimination  of
algorithm parameters from the interface also apply to the parameter of
working memory. However,  memory is special because it  is a parameter
whose  semantics are  independent of  any particular  data compression
algorithm.  It  seems  quite  sensible  to  create  a  generic  memory
parameter  in  the interface,  whereas  the  creation of  "parameter1"
(changes  in   which  will   have  different  effects   for  different
algorithms) seems less justified.

One problem with allowing the user  to specify the memory that will be
given to an algorithm is that  it does not allow algorithms to specify
a minimum memory  requirement. Some algorithms need  a certain minimum
amount of memory (the exact amount being particular to the algorithm).
For such cases, an alternative standard would have to either provide a
method for the algorithm to convey the requirement to the user (as the
proposed  standard does)  or provide  some means  for an  algorithm to
signal failure (which increases the messiness of the interface).

Another  disadvantage of  user-controlled  memory  procurement is  the
problems that  it causes with  the decompressor. If the  compressor is
handed a quantity  of memory that is not fixed  for a given algorithm,
then the decompressor  has to be informed of the  exact amount as well
(the  compressor can  do this  in the  first  few bytes  of the  coded
message). More  importantly, under this scheme,  the decompressor does
not learn of how  much memory it will require of the  user until it is
already executing and  has read the first few bytes  of the compressed
message.

After a lot of  thought I decided that the best  solution was to force
each algorithm  to advertise exactly  how much memory it  required for
compression and decompression.  It is then the user's job  to give the
algorithm  exactly   the  amount  of  memory   requested.  While  this
organization limits the capacity of  an individual algorithm (a la the
standard) to exploit the memory that could be made available to it, it
does allow the user program to choose one of many algorithms depending
on the memory  that it knows will be available  at the compression and
decompression ends.  Furthermore designers  of a  particular algorithm
can release  versions of  an algorithm under  related names  that vary
only in their memory requirements. E.g:

   lzrw1-1K
   lzrw1-8K
   lzrw1-16K
   lzrw1-32K

The standard specifies that an algorithm  be given no more memory than
it requested.  This eliminates  the complication  of conveying  to the
decompressor that more  memory was used. Algorithm  designers who want
to do this can split their algorithms as described above.

An added advantage of  this scheme is that it is likely  to lead to an
informal standardization  on a certain  set of memory sizes,  and this
will make it even easier to compare data compression algorithms.

In any design, there are always many arguments for including messy and
complicated features. However, in most  cases, the pain of eliminating
a feature  is more than compensated  be the benefits conferred  by the
resulting simplicity.  I believe that  this issue of memory  is such a
case.

On a related point, the standard specifies that the algorithm be given
memory  rather  than  seek  it  out  from  its  environment  (e.g.  by
allocating  stack space  or through  a malloc)  as the  capacities for
environments to provide memory vary from system to system. Examples:

   1) The Vax architecture does not permit stack frames greater than 64K.
   2) Heaps on the Macintosh sometimes becomes fragmented, making it
      impossible to allocate large contiguous chunks of memory.

Furthermore, if it were possible  for the data compression algorithm's
request for memory to fail, then  that failure condition would have to
be dealt with somehow, further muddying the interface.

By  forcing the  user  program  to front  up  with  the memory  before
invoking  the  data  compression  algorithm,  all  error  handling  is
performed outside  the data compression algorithm.  Furthermore, under
the  standard as  proposed, if  there is  not enough  memory to  run a
particular algorithm,  the user  program can choose  another algorithm
that requires  less memory  (assuming that  a selection  of algorithms
have been coded into the program - quite possible) whereas this not an
option available to each data compression algorithm.

9. Data Format
--------------
The  standard  purposely  steers  away   from  defining  any  sort  of
compressed data format. In contrast,  in any real system employing the
standard,  the user  is  forced to  record  the following  information
somewhere (either statically or dynamically):

   1) The id of the algorithm used to compress a particular set of data.
   2) The length of each compressed block of data.

In addition, the  user may also wish to store  other attributes of the
compression operation.

As an  example, the standard  COULD specify that each  algorithm place
the length  of each compressed block  in the first four  bytes of each
compressed block. However, the standard sets no such conditions on the
data format.

The reason  for this is  purity of  standardization. The goal  of this
standard  is to  define the  interface  between the  user program  and
conforming data  compression algorithms. To  impose any format  on the
data would be extend the standard "beyond its terms of reference".

10. Generalizing the Interface
------------------------------
The interface  that the proposed  standard gives for  data compression
algorithms is likely to be similar for other general data transforming
operations such as cryptographic and error correcting transformations,
and it may be asked why a more general standard has not been developed
for generalized information transformers.

The reason why  I did not do  this is because when  I investigated the
idea, I  found enough differences  between the classes  of generalized
transformation  listed to  put me  off. In  addition, the  issues that
arose in  the development  of the  proposed standard  data compression
algorithm interface convinced me that it would be best first to define
standard interfaces for a variety  of classes of transformers (with an
eye for  later combination) and  then look for similarities  that will
allow them to be combined. While  solving the more general problem may
sometimes  be easier  than  the specific  (Polya),  when designing  an
interface, important opportunities may be  missed in the interface for
one class of transforming algorithm.

11. Final Note
--------------
This  document   gives  particular   reasons  for   particular  design
decisions. In  many cases the  reasons are an  oversimplification of a
cluster of factors pointing towards  a particular decision. The author
acknowledges that this document is sketchy and incomplete, but at this
point  would rather  spend  his time  defending  the standard  against
particular  criticisms   from  particular  people  than   polish  this
document. Further refinement of this document will take place once the
standard has been aired.

--<End of Data Compression Interface Standard Justification>--

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/22/91)

In article <860@spam.ua.oz> ross@spam.ua.oz.au (Ross Williams) writes:
> If the standard were to provide  parameters in the interface which the
> user  program could  use to  dial  up various  versions of  particular
> algorithms, the  parameters provided  in the  interface would  vary in
> their semantic meaning from algorithm to algorithm.
> To resolve  this mess, the  standard prohibits parameters.  This means
> that generic algorithms that have parameters must be instantiated with
> specific static parameter values before  they can be standardized. The
> standard also forces the algorithm designer to specify how much memory
> the algorithm will require.

I find this completely unrealistic.

Ross, it sounds like you're trying to specify a standard for modem
(i.e., zero-delay) compression. But most compression in the real world
is high-delay. In fact, someone compressing data for an archive, or for
news batching, is much better off with control over the parameters. That
way he can adjust them to fit his needs without swapping in an entirely
different ``standard'' routine. He doesn't want to make the blocking
explicit, or to deal with the oodles of information you've defined; he
just wants to get a file or a bunch of files compressed as efficiently
and effectively as possible.

---Dan