[comp.compression] Response to Jean's criticism of proposed standard.

ross@spam.ua.oz.au (Ross Williams) (06/21/91)

From: jloup@nocturne.chorus.fr (Jean-Loup Gailly)

CRIT_004: CONTIGUOUS MEMORY AND TYPING
--------------------------------------

>| 2.2 Parameters
>| --------------
>| 2.2.1 A conforming  procedure must have a parameter  list that conveys
>| no more  and no less information  than that conveyed by  the following
>| "model" parameter list.
>|
>|    [...]
>|    INOUT memory    - A block of memory for use by the algorithm.
>|    [...]
>
>How do you deal with segmented architectures such as the 8086 and 286
>which impose a small limit (such as 64K) on the size of a segment?
>(There are only a few million machines still using this architecture :-)
>You could say that the MEMORY parameter is in fact a small array of
>pointers to distinct segments, but how does the caller chose the
>size of each segment? Allocating systematically 64K except possibly for the
>last segment would generally be a waste of memory. (Take an algorithm
>requiring two segments of 40K each.)

I  wasn't  thinking  of  segmented architectures  when  I  framed  the
standard as I have  not used them much. The only  time I did program a
PC,  I used  a  Microsoft C  compiler  which seemed  to  hide all  the
nastiness behind  a large memory model...

Can't we just make all the PCs go away?

More seriously I  think that the issues about memory  raised should be
divided into two issues:

   1) input and output blocks.
   2) working memory.

Although Jean  didn't seem to  raise the  problem of input  and output
blocks in segmented architectures I  think it should be dispatched now
so there is no confusion; I don't think there is a problem. Objects in
different segments can  be fed into the algorithm  in different blocks
within the same context block. No problem.

For the memory parameter I agree  that the standard as it stands has a
serious problem on segmented architectures.

>Even on non-segmented architectures, it may be cumbersome to force all
>memory used by the algorithm to be contiguous. The data structures used
>by a compression algorithm are usually much more complex than a single
>linear array, so the algorithm has to map somehow these data structures onto
>this linear sequence of bytes. This may be difficult with some progamming
>languages.

In low level languages such as C  this is not so hard; one just writes
a  simple allocator.  In high  level  languages there  are typing  and
representation problems.

One advantage of using a contiguous memory is that it absolutely nails
down the amount of memory the algorithm is using. If one is allocating
bits and pieces of  various types all over the place,  it is very easy
to lose track  of how much memory is actually  being used (e.g. hidden
bytes for alignments, hidden heap  management pointers). By giving the
programmer  a sandpit  with brick  walls, there  are no  excuses. This
makes comparing the memory consumption of algorithms easier.

>It is possible instead to add an INIT action which would let the compression
>algorithm allocate the memory in an optimal fashion and possibly
>return a failure boolean (or an Ada exception). Of course you also need
>a CLOSE operation to deallocate this memory.

One  problem  here  is  that   many  algorithms  allocate  heap  items
incrementally as  they build their data  model. It seems silly  for an
algorithm that is told that it can have three megabytes to make 100000
allocation calls only to later find out that it only has to process 1K
of  data! If  the standard  has to  be changed  so that  the algorithm
allocates its own memory  then I am in favour of allowing  it to do so
incrementally, but  to ensure that  it does not  use more than  it has
been allowed. The main argument  against would be the possibility that
the algorithm could fail midstream.

>Another objection about the MEMORY parameter as proposed is that it is by
>definition typeless. Even specific implementations in a strongly typed
>programming language would have to use a general type which is not
>suited to the algorithm. So the algorithm is forced to use type
>conversions (casts in C, unchecked conversions in Ada) which are
>generally not portable.  For example some implementations of Ada store
>an array descriptor before the array data or in a separate location.
>(Good implementations avoid the descriptor when the bounds are known
>at compile time but this is not required by the language.) Such
>descriptors must be constructed by the compiler. The implementer of
>the compression algorithm has no way to magically transform a raw
>sequence of bytes into a properly structured array together with its
>descriptor.
>
>If you let the algorithm allocate the data, then the standard memory
>allocation routine provided by the language can be used. The resulting
>pointer(s) (access values in Ada terminology) are opaque objects which
>can be stored in the MEMORY parameter. There is also at this point a
>necessary type conversion but it is much less troublesome. The chances
>of getting back a valid pointer after the reverse conversion are much
>higher. (Note that an Ada access value may be quite different from a machine
>address on some implementations.) In languages supporting opaque types
>such as Ada and C++ it would be preferable to get rid of all the unsafe
>type conversions completely and use a different type of the MEMORY
>parameter for each compression algorithm. But again this requires
>the compression algorithm to export a primitive to allocate this
>opaque object since the caller of the compression algorithm no longer
>knows how to allocate it.
>
>In short, I suggest that to avoid problems with segmented architectures
>and/or strongly typed languages, the memory used by a compression
>algorithm be allocated by the algorithm itself. The IDENTITY action
>would still determine the maximum amount of memory that the algorithm
>is allowed to allocate.
>
>Jean-loup Gailly

I'm convinced. Although there are lots of advantages in handing over a
block of  memory to algorithms, I  agree that it is  too unpleasant in
civilized programming languages.

I think  I made this error  because I had C  on the brain. When  I was
writing  the standard  I  had some  vague  idea  idea that  conforming
procedures in high  level languages would somehow be able  to turn the
memory  given to  them into  something strongly  typed (e.g.  hand the
memory over to be managed by a strongly typed heap manager). I assumed
that those  writing the child  standards for each language  would sort
out these difficulties. However, I now realize that this was a mistake
and that this is a reasonably deep problem that needs to be resolved.

I agree with your analysis and propose the following:

   1. The identity record specifies the maximum amount of memory to be
      used.
   2. This memory may be allocated from the heap only. (However, the
      procedure can still use up to 1024 bytes of other memory.)
   3. The fresh flag will be eliminated and two more actions added:
      a. Start new context block - Initialize data structures.
      b. End   new context block - Destroy    data structures.
   4. The memory parameter will become an INOUT generic pointer in which
      the algorithm should place a reference to the root of its
      data structure during an init call.
   5. The procedure will convey a status at the end of all calls. This
      can be done with exceptions in Ada or with an extra OUT parameter
      or in C as the value of the function. Error conditions will be:

         memory_allocation_error
         bad_input_block
         bad_data_structures

      and anything else I can think of when I revise the standard.

I think  that it is a  shame to move  away from the simplicity  of the
block of memory parameter. However, I agree that it is necessary.

Jean: Thanks  for your  comments. Do  you agree  with the  proposal as
given above?

Ross Williams
ross@spam.ua.oz.au

PS: Please don't  think I'm a typeless troglodyte.  My SAKDC algorithm
is written in Ada and is  full of rock-solid highly typed abstractions
using great  big heap based data  structures. It's just that  my brain
has been warped by coding in C recently.

jloup@nocturne.chorus.fr (Jean-Loup Gailly) (06/22/91)

In article <866@spam.ua.oz>, ross@spam.ua.oz.au (Ross Williams) writes
in response to my earlier comments:

> Can't we just make all the PCs go away?

Agreed. All but one (mine).

> One  problem  here  is  that   many  algorithms  allocate  heap  items
> incrementally as  they build their data  model. It seems silly  for an
> algorithm that is told that it can have three megabytes to make 100000
> allocation calls only to later find out that it only has to process 1K
> of  data! If  the standard  has to  be changed  so that  the algorithm
> allocates its own memory  then I am in favour of allowing  it to do so
> incrementally, but  to ensure that  it does not  use more than  it has
> been allowed.

I also agree that incremental allocation is important for some algorithms.
But I don't think that you need a bullet proof protection against
algorithms which do not follow the rules. If you cannot prove from
the source of your algorithm that the amount of required memory is bounded by
a certain value you are in trouble anyway. Even in your original proposal
you could not guard against malicious use of the stack or heap without
checking the source code.

> I agree with your analysis and propose the following:
> 
>    1. The identity record specifies the maximum amount of memory to be
>       used.
>    2. This memory may be allocated from the heap only. (However, the
>       procedure can still use up to 1024 bytes of other memory.)

I think you mean that the memory must not be allocated before the
"Initialize data structures" call and must be deallocated no later
than the "Destroy data structures" call. This is sufficient to
disallow static data. There is no need to make an explicit reference
to "heap". I am not convinced that you must disallow stack usage.
Many architectures do not have small limits on the stack size, and
stack space is cheaper than heap space. Of course the part of
the data structures which must live across compression calls must
be in the heap and accessed through the MEMORY parameter, but temporary
data could be on the stack. (For example the LZW uncompress requires a stack
to reverse the bytes of a code string.)

>    4. The memory parameter will become an INOUT generic pointer in which
>       the algorithm should place a reference to the root of its
>       data structure during an init call.

Please allow this pointer to have a meaningful type for strongly
typed languages. In other words, the type may be different for
different algorithms. In C terminology allow (struct foo *) instead of
(void *).

> Jean: Thanks  for your  comments. Do  you agree  with the  proposal as
> given above?

Yes. In particular it is nice to allow incremental allocation.

Jean-loup
jloup@chorus.fr

PS: My only strong disagreement is to be named Jean :-) Hyphenated
french names must not be split.

> PS: Please don't  think I'm a typeless troglodyte.  My SAKDC algorithm
> is written in Ada and is  full of rock-solid highly typed abstractions
> using great  big heap based data  structures. It's just that  my brain
> has been warped by coding in C recently.

I have myself fallen from the Ada language design team down to earth (C++) :-)
But I'd like to get your SAKDC.