ross@spam.ua.oz.au (Ross Williams) (06/20/91)
DATA COMPRESSION INTERFACE STANDARD
===================================
Author : Ross Williams (ross@spam.ua.oz.au).
Date : 20-Jun-1991.
CONTENTS
--------
1. Introduction
1.1 Summary of the Standard
1.2 Motivation for the Standard
1.3 Benefits of the Standard
1.4 Scope of the Standard
1.5 Definitions
2. Interface Specification
2.1 Structure of the Interface
2.2 Parameters
2.3 Use of Memory
2.4 Safety Requirement
2.5 Safety Recommendations
2.6 Identity Record
3. Administration
3.1 Introduction
3.2 Revision History
3.3 Child Standards
3.4 Version Control
1. INTRODUCTION
---------------
1.1 Summary of Standard
------------------------
This standard defines a standard interface for the communication
between computer programs using data compression algorithms and
implementations of data compression algorithms. The standard
constrains each data compression algorithm to be implemented as a
single procedure and defines the information that must pass through
the procedure's parameter list, and the behaviour of the procedure.
The standard is programming language independent, seeking not to
control the details of the interface but rather the structure of the
interface and the nature of the information passed through it.
1.2 Motivation for the Standard
-------------------------------
There are now so many data compression algorithms with so many
different characteristics, written in so many different languages, on
so many different systems that there is pressing need for a standard
data compression algorithm interface. Currently it is no small task to
port a data compression algorithm that has been tailored for one
particular application or environment to another. This portability
difficulty may be justified in messier domains such as user
interfaces, but there seems to be no need for this in a domain as
generic (the manipulation of data) as data compression.
1.3 Benefits of the Standard
----------------------------
Some benefits that will flow from this standard are listed below.
* The standard interface will allow programmers to write code that
uses data compression without concerning themselves with the
details of particular data compression algorithms. The programmer
will be able to write a program using the interface and then
quickly swap between various algorithms until the one best suited
to the application is found.
* The standard interface will assist those cataloging and comparing
algorithms.
* The standard interface will encourage the designers of data
compression algorithms to provide conforming implementations.
* The standard interface will make it easy for programmers to write
programs that select from a range of data compression algorithms
at run time.
1.4 Scope of the Standard
-------------------------
This standard seeks to constrain the manner in which user programs
communicate with data compression algorithms. The standard is defined
in terms of information flows and protocols and does not restrict
itself to the details of any one particular programming language.
However, the standard does distinguish the C programming language as
the preferred language for benchmark showcase implementations.
Although minor changes in the interfacing will be required when
converting a conforming algorithm from one language to another, the
standard ensures that this process will be far easier than it would
have been if the user programs in each language had picked a different
interface paradigm.
The standard does not seek to control the format of compressed data.
1.5 Definitions
---------------
Algorithm - In this document short for "data compression algorithm"
which is an abstract procedural specification for an operation on
data. It is important to distinguish between an algorithm and an
implementation of an algorithm.
Algorithm implementation - A procedure written in a particular
programming language that implements a particular abstract data
compression algorithm.
Byte - In this document, a byte is defined to be eight bits. Memory is
measured in bytes.
Conforming Implementation - An implementation of a data compression
algorithm that conforms to this standard.
Conforming Procedure (CP) - A procedure written in a particular
programming language that conforms to this standard.
Date string - A date string is a standard string of length 11 having
the format "dd-mmm-yyyy" where dd is in the range "01".."31", mmm
is in the range "Jan","Feb",.."Dec" (Case dependent), and yyyy is
in the range "1900" and "9999".
MAX_EXPANSION - This is a parameter of the standard which is currently
set at 1024 bytes. The parameter specifies the maximum expansion
(measured in bytes) per compressed block of data that a data
compression algorithm is permitted to make.
Printable character - A printable character is defined to be a byte in
the range [32,126]. In the ASCII character set, this corresponds
to the set of printable characters [' '..'~']. Note: On non-ASCII
machines this definition still stands --- the byte values are
still to be used. To print such a character on a non-ASCII machine
will require some kind of conversion.
Procedure - A procedure written in a conventional imperative
programming language.
RW - Ross Williams (ross@spam.ua.oz.au)
Standard string - A standard string is a variable length array of from
0 to 255 bytes having values that are members of the subrange of
printable characters.
User program - Any computer program employing a data compression
algorithm implementation.
2. INTERFACE SPECIFICATION
--------------------------
2.1 Structure of the Interface
------------------------------
2.1.1 A conforming implementation of a data compression algorithm
consists of one or more procedures, one of which is presented to the
user program as "the conforming procedure". This standard is not
concerned with the implementation of the conforming procedure, only
with its parameter list and its behaviour. The remainder of this
standard treats the conforming procedure and all the procedures and
functions that it might call, as a single procedure.
2.1.2 A conforming procedure communicates only through its parameter
list.
2.2 Parameters
--------------
2.2.1 A conforming procedure must have a parameter list that conveys
no more and no less information than that conveyed by the following
"model" parameter list.
IN action - The action requested of the data compression procedure.
INOUT memory - A block of memory for use by the algorithm.
IN fresh - Is this the start of a new context block?
IN inblock - An array of input bytes.
OUT outblock - An array of output bytes.
OUT identity - Record specifying attributes of the algorithm.
The keywords IN, INOUT, and OUT are used in the Ada sense to describe
the flow of information at the procedure interface:
IN allows the procedure only to read a parameter.
OUT allows the procedure only to write the parameter.
INOUT allows the procedure to both read and write the parameter.
The fact that the procedure has a particular power (i.e. reading and
or writing) over a parameter does not require the procedure to
exercise that power.
2.2.2 The parameter list in a particular programming language may look
quite different, with different names for parameters and perhaps fewer
or more parameters (e.g the input block might be passed as two
parameters being a pointer to the start of the block and the length of
the block) and possibly in a different order. However, the information
flowing across the interface should be the same.
2.2.3 ACTION: The action parameter determines the major purpose of the
call and takes one of the following three values which are ideally
implemented using an enumeration type:
Identity
Compress
Decompress
The three values command the following actions:
Action "Identity" instructs the conforming procedure to write into the
parameter "identity", a record value giving information about the
underlying algorithm and its characteristics. The remaining parameters
(fresh,model,inblock,outblock) must not be read or written.
Action "Compress" instructs the conforming procedure to write a
compressed representation of the input block to the output block.
Action "Decompress" instructs the conforming procedure to write a
decompressed representation of the input block to the output block.
2.2.4 MEMORY: This parameter is a block of memory handed by the user
program to the data compression algorithm for use during compression.
2.2.5 FRESH: If and only if this boolean parameter is false, a
conforming procedure can assume that the parameter "memory" is the
output from a previous call to the conforming procedure in the same
mode (i.e. compression or decompression). With this information, the
algorithm may choose to use the model present in the parameter
"memory" at the end of the previous block of data to operate upon the
next block of data.
2.2.6 INBLOCK, OUTBLOCK - These are blocks of memory handed to the
compression algorithm for reading and writing. Inblock must only ever
be read. Outblock must only ever be written. In this description of
the standard, Inblock and Outblock are treated as
dynamically-variable-length arrays of bytes. The length of such an
array X can be obtained with the length function length(X).
Conforming algorithms on a particular implementation must be able to
process the largest blocks of input that could possibly be handed to
them. If an algorithm can only process (say) 64K, then it should be
wrapped in a loop before being encapsulated in a conforming procedure.
The standard does not place any constraint on the degree to which a
decompression operation can expand the data. It is assumed that if the
user compressed the data, the user knows how big it will be when it is
decompressed. If this causes trouble, the user can explicitly store
the length of the decompressed data along with the compressed data.
2.2.7 IDENTITY - This parameter allows the conforming procedure to
return information about its algorithm. Upon return, this parameter
has an undefined value unless the parameter "action" had the value
"identity" in which case the parameter contains a record value
conveying information about the algorithm. In particular, one of the
fields of the record contains the amount of memory that the algorithm
requires in the "memory" parameter for compression and decompression.
The record is discussed in more detail in section 2.6.
2.2.8 This description of these parameters has been highly informal. A
more formal definition of the semantics of these parameters can be
found in section 2.4.2. The specification in 2.4.2 supercedes much of
the informal discussion above.
2.3 Use of Memory
-----------------
2.3.1 A conforming procedure must read and write only the memory in
the parameters of its parameter list with the exception of some
"extra" memory (which can take the form of registers or local stack
variables) which it may use for temporary use, subject to the
following conditions:
1) The particular language/environment must ensure that the memory
will be available (i.e. failure during allocation is impossible).
2) The total size of the extra memory does not exceed 1024 bytes.
3) The memory is not used to transfer information in or out of the
instantiation of the conforming procedure.
2.3.2 A conforming procedure may use the memory contained in the
"memory" parameter in any way it chooses. However, if a conforming
procedure uses the "fresh" parameter to determine whether there is a
model present in the "memory" parameter, then it must also ALWAYS
write a model it can make sense of in the "memory" parameter before
terminating. See 2.4.2 for a more formal definition.
2.3.3 It should be noted that the presence and semantics of the
"memory" parameter allows multiple instances of a compression
algorithm to be running at the same time, thus allowing multiple
streams of data to be processed in an interleaved fashion.
2.4 Safety Requirement
----------------------
2.4.1 Perhaps surprisingly, this standard does not require that a
conforming data compression algorithm implementation compress data, as
this is a mathematical impossibility for all data values. Instead, the
standard requires only that the candidate compression procedure
provide a reversible transformation and that it not expand the data by
more than a fixed constant amount MAX_EXPANSION bytes.
2.4.2 Conforming procedures must satisfy the following safety constraint.
BEGIN DEFINITION
WHERE
n : a constant integer in the range [1,infinty).
type bytearray = finite but variable-length array of bytes.
type vector = array(1..n) of bytearray.
A : variable vector.
B : constant set of vectors.
W,X,Y,Z : constant vector.
ID : constant identity record returned when action=identity.
CM,DM : constant non-negative integer.
model1 : variable type bytearray of size>=ID.mem_compress.
model2 : variable type bytearray of size>=ID.mem_decompress.
dummy : polymorphic dummy parameter list placeholder.
walrus : a candidate as a conforming procedure.
THE STANDARD DEFINES that for a candidate data compression procedure
named "walrus" to conform to the standard, there must exist exactly
one non-empty set B for each possible value of A such that:
{X=A}
walrus(compress,true ,model1,X1,Y1,dummy);
walrus(compress,false,model1,X2,Y2,dummy);
walrus(compress,false,model1,X3,Y3,dummy);
...
walrus(compress,false,model1,Xn,Yn,dummy);
{Y in B, AND FORALL i in 1..n, length(Yi)<=length(Ai)+MAX_EXPANSION}
AND
{W in B}
walrus(decompress,true ,model2,W1,Z1,dummy);
walrus(decompress,false,model2,W2,Z2,dummy);
walrus(decompress,false,model2,W3,Z3,dummy);
...
walrus(decompress,false,model2,Wn,Zn,dummy);
{Z=A}
AND
ID.deterministic=true => cardinality(B)=1
END DEFINITION
2.4.3 By making B a set, the definition above admits non-deterministic
algorithms (such as the LZRW1 algorithm) so long as they advertise the
fact by setting ID.deterministic=true. A non-deterministic algorithm
is one that non-deterministically chooses one of many representations
(possibly of differing lengths) for at least one possible input value.
2.4.4 The clause
FORALL i in 1..n, length(Yi)<=length(Ai)+MAX_EXPANSION
places a limit to the amount of expansion that a data compression
algorithm can perform during a compression operation. This requirement
means that algorithms that find that they have expanded data by more
than MAX_EXPANSION bytes will have to find some other method of
representing the data. As a last resort, an algorithm can always set a
copy flag at the start of the compressed block and simply represent
the data as itself.
2.5 Safety Recommendations
--------------------------
2.5.1 It is recommended that algorithms encode the following
information into their memory parameter for later checking.
* The algorithm id.
* A randomly chosen context block identification.
* The sequence number of the next block in the context.
* Whether the block was used for compression or decompression.
* A checksum of the block (to detect corruptions by the user).
2.5.2 It is recommended that algorithms encode the following
information into compressed blocks for later checking.
* The algorithm id.
* A randomly chosen context block identification.
* A sequence number for each block within a context.
* Length and checksum of the each compressed block.
* Length and checksum of the each decompressed block.
2.5.3 These measures will go a long way to ensuring that the algorithm
is being used according to specification. These recommendations are
recommendations only and cannot and should not be enforced as this
standard does not place any constraints on data formats. Also, some of
these recommendations may be prohibitively inefficient in some
applications.
2.6 Identity Record
-------------------
2.6.1 This section defines the information contained in identity
records returned in the parameter "identity" by conforming procedures
in response to a call with action=identity. The standard does not
define the exact details of the record, which will vary from language
to language. However, it is expected that the field names specified in
the various child standards will be very similar to the ones suggested
below.
2.6.2 The identity value returned by a conforming procedure in one
particular call must be the same as the value returned in all other
calls.
2.6.3 The identity record must contain the information contained in
the fields of the following "model" record:
record
id : natural in 0..(2^31)-1
mem_compress : natural
mem_decompress : natural
deterministic : boolean
name : standard string
version : standard string
date_creation : date string
date_release : date string
author : standard string
author_address : standard string
reference : standard string
mechanism : standard string
usage : standard string
legal : standard string
note : standard string
rec_min_context : natural
rec_min_block : natural
end record
2.6.5 This section contains a brief description of each field.
2.6.4.1 id: The "id" field is a 31 bit value being a unique
identification number of the underlying algorithm. Designers of
algorithms must choose an id for their algorithm by tossing a fair
coin 31 times, once for each bit of the id. The tossing of a coin is a
requirement of this standard. Under no circumstances is a computer's
random number generator to be used to generate an id. In the unlikely
event that two algorithm designers have chosen the same id number, one
of the algorithm's numbers will be changed manually. There may be many
conforming procedures implemented in the same or different languages
that have the same id. This is allowed so long as they:
1) Can all decompress each other's data.
2) All use the same amount of memory.
3) Are either all deterministic or non-deterministic.
2.6.4.2 mem_compress: The number of bytes of memory required by the
algorithm during a compression operation.
2.6.4.3 mem_decompress: The number of bytes of memory required by the
algorithm during a decompression operation.
2.6.4.4 deterministic: A boolean flag which is TRUE iff the algorithm
is guaranteed to generate at most one compressed representation for
each distinct input block.
2.6.4.5 name: The name of the algorithm (e.g. "LZRW1"). Note that the
identity of the algorithm rests in its id number, not its name.
2.6.4.6 version: Version information.
2.6.4.7 date_creation: The date on which this particular procedure was
first released.
2.6.4.8 date_update: The date of the most recent update of this
procedure.
2.6.4.9 author: The name of the author or publisher of the algorithm.
2.6.4.10 author_address: How to contact the author.
2.6.4.11 reference: A reference to a publication describing the
algorithm.
2.6.4.12 mechanism: A brief description of the way the algorithm
works.
2.6.4.13 usage: Hints on how the algorithm is best used.
2.6.4.14 legal: A message about the legal status of the code and the
algorithm.
2.6.4.15 note: Miscellaneous information about the algorithm.
2.6.4.16 rec_min_context: RECOMMENDED minimum length for context
blocks. Use of blocks of less than this length will seriously impair
compression. Note: The user is free to ignore this recommendation.
2.6.4.17 rec_min_block: RECOMMENDED number of bytes of overhead caused
by closing off a block (i.e. flushing the coder). Note: The user must
not RELY on this information in any way.
3. ADMINISTRATION
-----------------
3.1 Introduction
----------------
This section contains the administrative details of the standard. It
outlines what is expected of those constructing conforming data
compression algorithm implementations and records the details of the
standard's history.
3.2 Revision History
--------------------
This section gives the revision history of this standards document.
Date Name Action
---- ---- ------
19-Jun-1991 RW Created the first version of the standard.
3.3 Child Standards
-------------------
Although this standard specifies a generic language-independent
interface for data compression algorithm implementations, it is hoped
that child standards (Note: "substandard"s sounds bad!) will be
created for each programming language in which conforming data
compression algorithm implementations are written. Each child standard
should specify a procedure interface tightly enough to allow total
interchangeability of conforming procedures.
Although algorithm designers are free to implement conforming
procedures in whatever language they choose, it is hoped that they
will additionally create a version in C conforming to the C
child-standard. While this may seem inconvenient at the time, in the
long term, it will result in a large pool of interchangeable
algorithms whose performance characteristics on particular kinds of
data on particular kinds of machines can quickly be compared.
3.4 Version Control
-------------------
Once an algorithm designer has designed an algorithm, chosen an id
number and released the algorithm for others to use, there is only a
restricted set of changes that can be made to the algorithm without
having to define a new algorithm with a new id.
If an algorithm is modified, the revised algorithm should have either
a different id or a different version number.
A new algorithm with a new id must be created if one or more of the
following hold:
1 The old algorithm is unable to decompress any compressed data
generated by the new algorithm.
2 The new algorithm is unable to decompress any compressed data
generated by the old algorithm.
3 The algorithm has changed from deterministic to
non-deterministic. (Note: It is possible to do this without
violating 1 or 2).
4 The new algorithm requires more memory than the old algorithm.
These rules are designed to protect systems that have come to rely on
particular properties of an algorithm. Once an algorithm has been
created and been allocated an id, this standard does not support its
destruction.
--<End of Proposed Data Compression Interface Standard>--ian@airs.com (Ian Lance Taylor) (06/21/91)
ross@spam.ua.oz.au (Ross Williams) writes: >2.3 Use of Memory >----------------- >2.3.1 A conforming procedure must read and write only the memory in >the parameters of its parameter list with the exception of some >"extra" memory (which can take the form of registers or local stack >variables) which it may use for temporary use, subject to the >following conditions: > 1) The particular language/environment must ensure that the memory > will be available (i.e. failure during allocation is impossible). > 2) The total size of the extra memory does not exceed 1024 bytes. > 3) The memory is not used to transfer information in or out of the > instantiation of the conforming procedure. This doesn't quite make sense to me as written, although this is really just nitpicking. The program itself is probably in memory, and it surely acceptable to read it. It is surely also acceptable to use some sort of constant table during compression (a precomputed frequency count, perhaps). Finally, restriction 3 appears to prohibit the accumulation of statistics across several calls; if those statistics do not affect the compression, but merely provide information that the program can access in some non-standard fashion, why prohibit them? >2.6.3 The identity record must contain the information contained in >the fields of the following "model" record: How about a field for the version number of the standard, so that new fields can be more easily added to this record? -- Ian Taylor ian@airs.com uunet!airs!ian First person to identify this quote wins a free e-mail message: ``If he could have moved, he would have gotten up and gone after the man to thank him for wearing something so marvelously interesting.''
jloup@nocturne.chorus.fr (Jean-Loup Gailly) (06/21/91)
In article <859@spam.ua.oz>, ross@spam.ua.oz.au (Ross Williams) writes: | 2.2 Parameters | -------------- | 2.2.1 A conforming procedure must have a parameter list that conveys | no more and no less information than that conveyed by the following | "model" parameter list. | | [...] | INOUT memory - A block of memory for use by the algorithm. | [...] How do you deal with segmented architectures such as the 8086 and 286 which impose a small limit (such as 64K) on the size of a segment? (There are only a few million machines still using this architecture :-) You could say that the MEMORY parameter is in fact a small array of pointers to distinct segments, but how does the caller chose the size of each segment? Allocating systematically 64K except possibly for the last segment would generally be a waste of memory. (Take an algorithm requiring two segments of 40K each.) Even on non-segmented architectures, it may be cumbersome to force all memory used by the algorithm to be contiguous. The data structures used by a compression algorithm are usually much more complex than a single linear array, so the algorithm has to map somehow these data structures onto this linear sequence of bytes. This may be difficult with some progamming languages. It is possible instead to add an INIT action which would let the compression algorithm allocate the memory in an optimal fashion and possibly return a failure boolean (or an Ada exception). Of course you also need a CLOSE operation to deallocate this memory. Another objection about the MEMORY parameter as proposed is that it is by definition typeless. Even specific implementations in a strongly typed programming language would have to use a general type which is not suited to the algorithm. So the algorithm is forced to use type conversions (casts in C, unchecked conversions in Ada) which are generally not portable. For example some implementations of Ada store an array descriptor before the array data or in a separate location. (Good implementations avoid the descriptor when the bounds are known at compile time but this is not required by the language.) Such descriptors must be constructed by the compiler. The implementer of the compression algorithm has no way to magically transform a raw sequence of bytes into a properly structured array together with its descriptor. If you let the algorithm allocate the data, then the standard memory allocation routine provided by the language can be used. The resulting pointer(s) (access values in Ada terminology) are opaque objects which can be stored in the MEMORY parameter. There is also at this point a necessary type conversion but it is much less troublesome. The chances of getting back a valid pointer after the reverse conversion are much higher. (Note that an Ada access value may be quite different from a machine address on some implementations.) In languages supporting opaque types such as Ada and C++ it would be preferable to get rid of all the unsafe type conversions completely and use a different type of the MEMORY parameter for each compression algorithm. But again this requires the compression algorithm to export a primitive to allocate this opaque object since the caller of the compression algorithm no longer knows how to allocate it. In short, I suggest that to avoid problems with segmented architectures and/or strongly typed languages, the memory used by a compression algorithm be allocated by the algorithm itself. The IDENTITY action would still determine the maximum amount of memory that the algorithm is allowed to allocate. Jean-loup Gailly Chorus systemes, 6 av G. Eiffel, 78182 St-Quentin-en-Yvelines-Cedex, France email: jloup@chorus.fr Tel: +33 (1) 30 64 82 79 Fax: +33 (1) 30 57 00 66
brad@looking.on.ca (Brad Templeton) (06/21/91)
Compressors often have very tight inner loops where all the CPU is spent and where slight increases in efficiency can result in major increases in speed. In this case, the difference between a static array (say for your hashtable) and a dynamic one can actually be real, particularly for the more efficient compressors that only do a few instructions per loop. -- Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473
gtoal@tharr.UUCP (Graham Toal) (06/23/91)
In article <859@spam.ua.oz> ross@spam.ua.oz.au (Ross Williams) writes: >DATA COMPRESSION INTERFACE STANDARD >=================================== I'm afraid I've missed out on this group for some time, and just got back. This rather lengthy document sounds just like my suggestion last month for putting a wrapper round compression algorithms with a blocked interface - the blocking allowing fast seeking to random points within a compressed file. The basic idea is fine but do we need 500+ lines of pretentious twaddle to describle it? G -- (* Posted from tharr.uucp - Public Access Unix - +44 (234) 841503 *)