ross@spam.ua.oz.au (Ross Williams) (06/22/91)
CRIT_006: MORE ON BLOCKING -------------------------- From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein) Date: 21 Jun 91 20:43:38 GMT || To resolve this mess, the standard prohibits parameters. || The standard also forces the algorithm designer to specify how much memory || the algorithm will require. | |I find this completely unrealistic. | |Ross, it sounds like you're trying to specify a standard for modem |(i.e., zero-delay) compression. But most compression in the real world |is high-delay. In fact, someone compressing data for an archive, or for |news batching, is much better off with control over the parameters. That |way he can adjust them to fit his needs without swapping in an entirely |different ``standard'' routine. He doesn't want to make the blocking |explicit, or to deal with the oodles of information you've defined; he |just wants to get a file or a bunch of files compressed as efficiently |and effectively as possible. I'm not convinced. I think that allowing algorithm parameters will open up a really disgusting can of worms. I see no harm in having a library of exact tuned algorithms. Swapping between different standard routines will be far easier than fiddling parameters (which can go out of range etc). If a user wants to create another tuning, he can just fiddle with the source code for the algorithm until he is happy and then fiddle the identity record to create a new algorithm. The standard does not stop people fiddling with the parameters of an algorithm; it just stops them giving such vague objects standardized IDs. (Another argument is that many algorithms can be implemented more efficiently if their parameters are statically specified). The user of the algorithm doesn't have to deal with oodles of information to USE an algorithm. Only if they want a non-implemented tuning. In my experience, a handful of tunings suffice for most algorithms. |> However there are two very strong arguments for the block interface: |> EFFICIENCY: A stream interface requires one procedure call per byte |> processed. For high speed applications this is unacceptable. | |Huh? UNIX's read() is a stream interface, and it sure doesn't require |one procedure call per byte processed. You're right. One can specify a block of bytes. Sorry, I got confused. |> DIRECTNESS: A stream interface will force many algorithms to become |> involved in unnecessary buffering. | |Your block interface forces *all* algorithms to become involved in |unnecessary blocking. Again it sounds like you're concentrating on |zero-delay modem compression. I don't think compressors should be forced |into that model. The interface does not force the algorithms into blocking, it forces the user programs into it. This is a different thing because the user program is in a better position to decide how big the blocks should be. I am wary of a stream interface because it is then impossible to place a bound on the amount of output one could get from feeding in a single byte of input (during either compression or decompression). This is more acceptable if the input and output are hooked up to streams, but if they are hooked up to memory there is a big problem. Maybe a bound could be set as a parameter. |> In contrast, a stream interface can always be efficiently constructed |> over the top of a block interface. | |This is not true, especially given your requirement that the block |compressor be memoryless. The interface does not require the compression routine to be memoryless, only that it close the coding at the end of each block. Were you referring to that, or did you think that the standard required contexts to be closed at the end of every ordinary block? So where does all this leave us? If the cost of avoiding a second stream layer is low, I ma in favour of it. I think that everyone could be accommodated if the standard is modified as follows: * Introduce a flush flag. This effectively makes the interface a stream interface. * Somehow place an upperbound on the output that an algorithm can generate for a given input(i.e. place a bound on buffering). This could be: - Statically specified by the standard (e.g. 10K). - A property of the algorithm (in the identity record). - A parameter to the interface. I am in favour of a low (e.g. 1K) static bound. * Retain the memory block input and output interface. How does that sound? I'm still thinking about it. Ross Williams ross@spam.ua.oz.au
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/23/91)
In article <873@spam.ua.oz> ross@spam.ua.oz.au (Ross Williams) writes: > I'm not convinced. I think that allowing algorithm parameters will > open up a really disgusting can of worms. I see no harm in having a > library of exact tuned algorithms. I do. Consider, for instance, my yabba coder. There's not only the memory parameter -m, but a blocking test length -z and ``fuzz'' -Z that affect compression. Twiddling -z and -Z can produce up to 10% improvements on some types of files; there's no small set of best choices, and forcing users into fixed values of -z and -Z, let alone -m, would be insane. > If a user wants to create another tuning, he can just fiddle with the > source code for the algorithm until he is happy and then fiddle the > identity record to create a new algorithm. Uh-uh. This contradicts your stated goal of having the identification numbers mean something for comparison purposes. If you want to achieve that goal, just insist that everyone name all the parameters (in my case, -m/-z/-Z) when quoting compression results. (It gets even hairier when you're measuring speed and memory: for yabba, you'd have to quote MOD and PTRS as well as the machine type.) Don't force parameterized coders into a parameterless model. > (Another argument is that many algorithms can be implemented more > efficiently if their parameters are statically specified). Hardly. > The user of the algorithm doesn't have to deal with oodles of > information to USE an algorithm. Only if they want a non-implemented > tuning. One advantage of a program-based interface is that the user is given options. If he doesn't know about the options or doesn't want to deal with them, he doesn't use them, but they're always there for sophisticated users. You could mimic this in your library routine by passing in an array of options, say of { int, int } pairs, with a method-dependent meaning. [ streams versus blocks ] I think the real argument here is over whether the compressor should drive the I/O or vice versa. In modem applications, your standard is fine: the I/O has to drive the compressor. But in most applications it's much easier to work the other way round. > |> DIRECTNESS: A stream interface will force many algorithms to become > |> involved in unnecessary buffering. > |Your block interface forces *all* algorithms to become involved in > |unnecessary blocking. Again it sounds like you're concentrating on > |zero-delay modem compression. I don't think compressors should be forced > |into that model. > The interface does not force the algorithms into blocking, it forces > the user programs into it. No, it also forces the algorithms into blocking. Many algorithms don't want to deal with blocks. They want to deal with streams. Here, you can notch this up as a separate criticism: You expect algorithms to have enough special cases that they work well if you give them blocks of three bytes at a time. But most real LZW-style compressors work horribly on such short blocks. Most compressors do not support plaintext or escape codes, and it is unrealistic to expect otherwise. > I am wary of a stream interface because it is then impossible to place > a bound on the amount of output one could get from feeding in a single > byte of input (during either compression or decompression). Again it sounds like you're focusing on modems. Why? ---Dan
gudeman@cs.arizona.edu (David Gudeman) (06/23/91)
I have to agree with Dan (readers of comp.lang.misc are probably experiencing a feeling of unreality right now...) that parameters are important. Any standard that neglects them will be ignored by everyone who implements an algorithm with natural parameters. How about the following: a compression procedure with parameters will use a global variable for each one (using some naming convention based on the name of the procedure). The user's program does not have to refer to the global variable at all, and if it doesn't then some natural default is used. When you name such a procedure for comparison to other procedures, any parameters that are not mentioned are assumed to be the defaults. One advantage of this for C is that in those cases where you want to hard code the parameters you can #define them in the source of the compression function without changing the function's formal parameters. (It seemed that Dan doesn't believe there can be an advantage in hard coding the parameters, which I find peculiar. There are many optimizations that can be done with known values over variables.) As to the stream/block problem, I suggest that it is unreasonable to try to force exactly _one_ interface on all algorithms and applications. Obviously there are applications for each and the standard should apply to each. In other words, there should be a standard stream interface _and_ a standard block interface. Let the aplication programmer choose whichever interface is most appropriate and let the compression implementer deal with any problems the given compression procedure may have with either interface. Add a field to the identity records that says whether the algorithm works "best" with streams, blocks, or either. I'd also like to repeat someone else's suggestion of seperating the compression procedure and decompression procedure. I don't understand how this effects having a compression method as a first-class function. So you have two first-class functions for each compression method instead of one, what's the big deal? For that matter, I'd add a third function -- the one that returns the identity record. I don't see any value to having this be part of the compression function itself. -- David Gudeman gudeman@cs.arizona.edu noao!arizona!gudeman
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/23/91)
In article <4496@optima.cs.arizona.edu> gudeman@cs.arizona.edu (David Gudeman) writes: > How about the following: a compression procedure with parameters will > use a global variable for each one (using some naming convention based > on the name of the procedure). The user's program does not have to > refer to the global variable at all, and if it doesn't then some > natural default is used. I expect that Ross will disapprove of this because he can't pass the set of parameters around as arguments if they're global variables. If you have an array argument giving any number of { parameter, value } pairs in any order, this problem disappears. > (It seemed that Dan doesn't believe there can be an > advantage in hard coding the parameters, which I find peculiar. There > are many optimizations that can be done with known values over > variables.) It depends. Parameters like hash table size (and hash function!) have to be hard-coded if you want any efficiency, as do ``parameters'' like the type you use for storing hash values. But I've never seen a parameter that affects compression and has to be hard-coded. (You get about a 0.4% speedup by specifying a single value of ``min'' in yabba, for example.) Finally, any good language lets you say ``let foo() be a version of bar() where argument x is 57'' and optimizes that independently. > As to the stream/block problem, I suggest that it is unreasonable to > try to force exactly _one_ interface on all algorithms and > applications. Exactly. I imagine modem manufacturers would love this standard, for example, while UNIX users and programmers would love a program-based standard. Here's a challenge, Ross: Rewrite compress to conform to your standard. If you can do that, publish it and explain why it's useful. I'm always wary of ``standards'' that haven't been given at least two unrelated implementations. > I'd also like to repeat someone else's suggestion of seperating the > compression procedure and decompression procedure. Ditto. ---Dan
gtoal@tharr.UUCP (Graham Toal) (06/23/91)
In article <3406.Jun2302.12.0491@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
:Exactly. I imagine modem manufacturers would love this standard, for
:example, while UNIX users and programmers would love a program-based
:standard. Here's a challenge, Ross: Rewrite compress to conform to your
:standard. If you can do that, publish it and explain why it's useful.
:I'm always wary of ``standards'' that haven't been given at least two
:unrelated implementations.
:
:> I'd also like to repeat someone else's suggestion of seperating the
:> compression procedure and decompression procedure.
:
:Ditto.
That's exactly what I was trying to say when I referred rather tactlessly
(apologies, Ross) to 'pretentious twaddle'. Imposed standards just don't
work. Go ahead and implement the ideas you're suggesting, publish them
in source form on the net, and let the market decide. I've a *lot* of
time for Dan because instead of moaning about compress, he went out and
implemented something that works. (Whether it will replace compress is
something I'm sitting back and hopefully watching for...)
Incidentally, I've implemented something much like you describe. For it
to be portable, you have to (yuk - I do it but I hate it) mandate that
none of the data structures used by compression or decompression are
larger than 32k bytes, even if this reduces the compression you get.
And seperating out compression and decompression procs is worthwhile.
The biggest thing a standard could do for portability is this:
1) Define the file format as a series of blocks; each block can
be compressed with an arbitrary algorithm to a fixed interface.
2) Define the procedure interface exactly (eg decomp(inblock,outblock,
insize, outsize) -> status) and arrange for the encompassing
compressed file TO INCLUDE THIS C CODE so that any decoder can
extract it, link it in, and use it to extract the data.
This might seem a stupid suggestion but it's the only one I can
think of which will allow backwards and forwards compatibility. otherwise
you get into the pkzip thing of rushing around tracking newer versions
of algorithms and having decompression programs with twenty different
algorithms built in.
Graham
PS Ross, I probably wouldn't have shot from the lip in that last post
if you'd titled it 'request for comments' or even 'draft standard'...
--
(* Posted from tharr.uucp - Public Access Unix - +44 (234) 841503 *)