ross@spam.ua.oz.au (Ross Williams) (06/20/91)
NOTES ON THE DESIGN OF THE PROPOSED DATA COMPRESSION INTERFACE STANDARD
-----------------------------------------------------------------------
Author : Ross Williams.
Date : 20-Jun-1991.
CONTENTS
1. Introduction
2. Typical Applications
3. Control Structure
4. Input Structure
5. Flushing
6. Buffering
7. Definition of an Algorithm
8. Working Memory
9. Data Format
10. Generalizing the Interface
11. Final Note
1. Introduction
---------------
This document attempts to justify a selection of design decisions that
determined the proposed data compression interface standard.
2. Typical Applications
-----------------------
The following typical applications of data compression serve to ground
the standard in the practical. The standard was framed with these
applications in mind.
UNIX FILTER COMPRESSOR: A Unix data compression program reads a stream
of bytes from standard input with no foreknowledge of when the stream
is going to end, and writes a compressed representation of the stream
of bytes to standard output. The compression algorithm is implemented
as a process which has complete control over its control flow and IO.
MEMORY BLOCK COMPRESSOR: A word processing program divides the text
being edited into blocks and stores each block in compressed form,
decompressing it on demand. The data compression algorithm is fed a
block of bytes and is expected to produce a compressed representation
of that block.
MODEM COMPRESSOR: An error correcting modem employs a data compression
algorithm implemented as an abstract data type. Sometimes, when
screens of data are to be transmitted, a block of 2K is presented for
compression. At other times, when a user keystroke has to be
transmitted without delay, a single byte is presented for compression.
In both cases, the resulting representation must be coded in the
context of the previous bytes compressed. The modem instructs the
algorithm to forgets its context between connect sessions.
BUFFERED FILE COMPRESSOR: A file compression program compresses a set
of files independently. It divides each file into blocks of 50K and
feeds the blocks in succession to a data compression algorithm
implemented as an abstract data type. The algorithm must compress each
block in the context of previous blocks transmitted.
3. Control Structure
--------------------
The applications described above differ widely in the degree of
control and data retention capability given to the data compression
algorithm. Control ranges from tightest to loosest as follows:
* Procedural Example: Block.
* Abstract data type Example: Modem,File.
* Process Example: Unix.
In defining a standard, a particular level of control had to be
decided upon. This was a fairly easy decision.
Procedural organizations do not allow a data compression algorithm to
retain information between invocations which makes them unsuitable for
context stream applications.
Process organizations are very general but are messy and awkward to
implement in languages with only one thread of control (most
languages).
In choosing a control structure the abstract data type structure is
the clear winner. An ADT structure allows context to be preserved but
does not wrench control from the user program.
4. Input Structure
------------------
The decision made in section 3 implies that the user program will
create blocks of data and invoke a compression procedure to compress
the data. The compression procedure will be passed a block of memory
which it may use to create a model of the data and to pass information
(such as the model) between invocations.
The next major design decisions involves the imposition of structure
on the sequence of data blocks.
* CONTEXT : When should the algorithm ignore its model and start afresh?
* FLUSHING : When should the algorithm complete a decompressible block?
* CHUNKS : What should be the size of data blocks fed to the algorithm?
By "ignore its model" is meant cease to use any knowledge about the
data already seen to compress the data about to be seen. By flushing
is meant writing zero or more output bytes to complete the
representation of the bytes already received for compression.
In the general case, the processing of a stream of blocks of input
data can be modelled as follows:
The input data consists of
Zero or more CONTEXT blocks consisting of
One or more CLOSED blocks consisting of
One or more PIECE blocks consisting of
Zero or more bytes.
Context Blocks
+----------------------------------------+ +------------------------+
| Closed Blocks | | (similar over |
| +----------------+ +-----------+ +-+ | | here too) |
| | Piece Blocks | | | | | | | |
| | +--+ +--+ +--+ | | +--+ +--+ | | | | | |
| | | | | | | | | | | | | | | | | | | |
----->
^ ^ ^ ^ ^ <--- "Calls" to data compressor. Input &
^Flush coder buffer ^Destroy model Time
Piece blocks consist of zero or more bytes. The algorithm reads each
piece block and writes zero or more bytes to the output in response.
At the end of each context block, the compressor destroys its model of
the data and approaches the next context block with a clean slate.
At the end of each closed block, the compressor flushes its coder
buffers. The output produced during the processing of a closed block
is a complete representation of the closed block.
At the end of each piece block the compressor writes zero or more
bytes to the output but will not necessarily flush a complete
representation of what it has read.
At this stage, we have an interface that looks roughly like this:
IN action - The action to be performed (identity,compress,decompress).
INOUT memory - The compressor algorithm's ADT data (=activation record).
IN inblock - The input block.
OUT outblock - The output block.
IN context? - Use the context from the previous block?
IN flush? - Flush the coder?
with the boolean flags (marked by "?") indicating when the algorithm
should discard its context and flush its coder.
5. Flushing
-----------
The input structure described in section 4 is general and will satisfy
most users. However, it is also a bit messy. It is inevitable that
programmers will be confused by the context and flush flags and subtle
bugs will be created. It would be desirable to simplify the interface
somehow.
The following table shows how the four applications described in
section 2 make use of the input structure described in section 4.
+--------------------------------------------------------------+
| Context Blocks Closed Blocks Piece Blocks |
+--------+--------------------------------------------------------------+
| UNIX | One One One byte long |
| MEMORY | One One One per closed block |
| MODEM | Many Many One per closed block |
| FILE | Many Many One per closed block |
+--------+--------------------------------------------------------------+
This table indicates that the input structure of piece blocks (and
hence the flush flag) may not be necessary. The only application that
uses piece blocks is the Unix compressor which could quite easily
circumvent the problem using buffering.
In general, absence of an optional-flush option becomes awkward only
when the piece block size is very small (as in the case of the unix
compressor). If the cost of flushing is (say) two bytes, forced
flushing on piece blocks of 1024 bytes carries negligible cost whereas
forced flushing of piece blocks of 1 byte carries an enormous cost.
Additional reasons for the decision not to use a flush flag are given
below:
1. The unit of compression is the closed block. Because the
standard cannot guarantee that even the first byte of an
uncompressed closed block can be made available until the last
byte of the compressed representation of the closed block has been
received, the piece block level is seen to be a buffering
mechanism which is better performed externally.
2. The user program is better able to assess the availability of
memory for buffering.
SUMMARY: Although a difficult decision, my vote goes to the
elimination of the flush flag and the compulsory imposition of
flushing at the end of each data block presented to the compression
algorithm.
6. Buffering
------------
Various arguments can be raised for providing an interface that forces
the compression algorithm to perform various levels of buffering.
Currently the algorithm allows the user to specify the size of input
blocks but not the size of output blocks. Some might desire a standard
in which the size of the input and output blocks can be specified by
the user with the algorithm buffering a stream in the middle.
The best argument against buffering is that it imposes an artificial
layer and hides the closed block structure. In the end, it must be
remembered that data compression algorithms should compress data and
that this means receiving some data and making it smaller. Any attempt
to hide the fact that the data has become smaller (e.g. by buffering)
is a cosmetic addition that (while useful) should be located outside
the standard. Perhaps there could be second order standard for a shell
procedure that provides a soft buffered interface.
7. Definition of an Algorithm
-----------------------------
The definition of a data compression algorithm provided an interesting
challenge in the framing of the standard.
Authors of data compression algorithms typically publish a paper
describing the algorithm in generic terms. The "algorithm", which is
often given a generic-sounding name, is usually specified as a
parameterized object having parameters that affect the tradeoff
between the following performance characteristics:
COMPRESSION : Compression yielded by the algorithm.
MEMORY : Memory required by the algorithm.
SPEED : Speed of the algorithm.
CODE COMPLEXITY : Degree of difficulty to implement.
ADAPTIVITY : Speed at which the algorithm adapts to new data.
GENERALITY : Degree to which algorithm is tailored for specific data.
The presence of parameters and the often vague description of the
algorithms themselves does not help in the comparison of algorithms.
Those comparing algorithms are often forced to pick parameters in
ignorance of the true impact of a particular parameter on the
performance measures listed above.
If the standard were to provide parameters in the interface which the
user program could use to dial up various versions of particular
algorithms, the parameters provided in the interface would vary in
their semantic meaning from algorithm to algorithm.
To resolve this mess, the standard prohibits parameters. This means
that generic algorithms that have parameters must be instantiated with
specific static parameter values before they can be standardized. The
standard also forces the algorithm designer to specify how much memory
the algorithm will require.
The benefits here are enormous. By fixing algorithms in granite, the
standard provides fixed points that investigators can refer to when
comparing algorithms or classes of algorithm. By requiring that
algorithm designers choose a good set of parameters before being
allowed to standardize an algorithm, the standard places the
responsibility for parameter choice in the hands of those who are in
the best position to tune the algorithm.
8. Working Memory
-----------------
To an extent, the arguments given above for the elimination of
algorithm parameters from the interface also apply to the parameter of
working memory. However, memory is special because it is a parameter
whose semantics are independent of any particular data compression
algorithm. It seems quite sensible to create a generic memory
parameter in the interface, whereas the creation of "parameter1"
(changes in which will have different effects for different
algorithms) seems less justified.
One problem with allowing the user to specify the memory that will be
given to an algorithm is that it does not allow algorithms to specify
a minimum memory requirement. Some algorithms need a certain minimum
amount of memory (the exact amount being particular to the algorithm).
For such cases, an alternative standard would have to either provide a
method for the algorithm to convey the requirement to the user (as the
proposed standard does) or provide some means for an algorithm to
signal failure (which increases the messiness of the interface).
Another disadvantage of user-controlled memory procurement is the
problems that it causes with the decompressor. If the compressor is
handed a quantity of memory that is not fixed for a given algorithm,
then the decompressor has to be informed of the exact amount as well
(the compressor can do this in the first few bytes of the coded
message). More importantly, under this scheme, the decompressor does
not learn of how much memory it will require of the user until it is
already executing and has read the first few bytes of the compressed
message.
After a lot of thought I decided that the best solution was to force
each algorithm to advertise exactly how much memory it required for
compression and decompression. It is then the user's job to give the
algorithm exactly the amount of memory requested. While this
organization limits the capacity of an individual algorithm (a la the
standard) to exploit the memory that could be made available to it, it
does allow the user program to choose one of many algorithms depending
on the memory that it knows will be available at the compression and
decompression ends. Furthermore designers of a particular algorithm
can release versions of an algorithm under related names that vary
only in their memory requirements. E.g:
lzrw1-1K
lzrw1-8K
lzrw1-16K
lzrw1-32K
The standard specifies that an algorithm be given no more memory than
it requested. This eliminates the complication of conveying to the
decompressor that more memory was used. Algorithm designers who want
to do this can split their algorithms as described above.
An added advantage of this scheme is that it is likely to lead to an
informal standardization on a certain set of memory sizes, and this
will make it even easier to compare data compression algorithms.
In any design, there are always many arguments for including messy and
complicated features. However, in most cases, the pain of eliminating
a feature is more than compensated be the benefits conferred by the
resulting simplicity. I believe that this issue of memory is such a
case.
On a related point, the standard specifies that the algorithm be given
memory rather than seek it out from its environment (e.g. by
allocating stack space or through a malloc) as the capacities for
environments to provide memory vary from system to system. Examples:
1) The Vax architecture does not permit stack frames greater than 64K.
2) Heaps on the Macintosh sometimes becomes fragmented, making it
impossible to allocate large contiguous chunks of memory.
Furthermore, if it were possible for the data compression algorithm's
request for memory to fail, then that failure condition would have to
be dealt with somehow, further muddying the interface.
By forcing the user program to front up with the memory before
invoking the data compression algorithm, all error handling is
performed outside the data compression algorithm. Furthermore, under
the standard as proposed, if there is not enough memory to run a
particular algorithm, the user program can choose another algorithm
that requires less memory (assuming that a selection of algorithms
have been coded into the program - quite possible) whereas this not an
option available to each data compression algorithm.
9. Data Format
--------------
The standard purposely steers away from defining any sort of
compressed data format. In contrast, in any real system employing the
standard, the user is forced to record the following information
somewhere (either statically or dynamically):
1) The id of the algorithm used to compress a particular set of data.
2) The length of each compressed block of data.
In addition, the user may also wish to store other attributes of the
compression operation.
As an example, the standard COULD specify that each algorithm place
the length of each compressed block in the first four bytes of each
compressed block. However, the standard sets no such conditions on the
data format.
The reason for this is purity of standardization. The goal of this
standard is to define the interface between the user program and
conforming data compression algorithms. To impose any format on the
data would be extend the standard "beyond its terms of reference".
10. Generalizing the Interface
------------------------------
The interface that the proposed standard gives for data compression
algorithms is likely to be similar for other general data transforming
operations such as cryptographic and error correcting transformations,
and it may be asked why a more general standard has not been developed
for generalized information transformers.
The reason why I did not do this is because when I investigated the
idea, I found enough differences between the classes of generalized
transformation listed to put me off. In addition, the issues that
arose in the development of the proposed standard data compression
algorithm interface convinced me that it would be best first to define
standard interfaces for a variety of classes of transformers (with an
eye for later combination) and then look for similarities that will
allow them to be combined. While solving the more general problem may
sometimes be easier than the specific (Polya), when designing an
interface, important opportunities may be missed in the interface for
one class of transforming algorithm.
11. Final Note
--------------
This document gives particular reasons for particular design
decisions. In many cases the reasons are an oversimplification of a
cluster of factors pointing towards a particular decision. The author
acknowledges that this document is sketchy and incomplete, but at this
point would rather spend his time defending the standard against
particular criticisms from particular people than polish this
document. Further refinement of this document will take place once the
standard has been aired.
--<End of Data Compression Interface Standard Justification>--brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/22/91)
In article <860@spam.ua.oz> ross@spam.ua.oz.au (Ross Williams) writes: > If the standard were to provide parameters in the interface which the > user program could use to dial up various versions of particular > algorithms, the parameters provided in the interface would vary in > their semantic meaning from algorithm to algorithm. > To resolve this mess, the standard prohibits parameters. This means > that generic algorithms that have parameters must be instantiated with > specific static parameter values before they can be standardized. The > standard also forces the algorithm designer to specify how much memory > the algorithm will require. I find this completely unrealistic. Ross, it sounds like you're trying to specify a standard for modem (i.e., zero-delay) compression. But most compression in the real world is high-delay. In fact, someone compressing data for an archive, or for news batching, is much better off with control over the parameters. That way he can adjust them to fit his needs without swapping in an entirely different ``standard'' routine. He doesn't want to make the blocking explicit, or to deal with the oodles of information you've defined; he just wants to get a file or a bunch of files compressed as efficiently and effectively as possible. ---Dan