ross@spam.ua.oz.au (Ross Williams) (06/27/91)
I've given up numbering the latest standard criticisms (I can only count to F :-) and trying to decide whether to topic stream them and so on, so I've just bunched them all up and answered them in one go. Thanks to everyone who has contributed so far. I appreciate the time and effort and I think that a good standard will result. Ross Williams ross@spam.ua.oz.au ---------------------------------------------------------------------- |From: ian@airs.com (Ian Lance Taylor) |>OK OK I'm being a martinet again. Agreed. The standard has to be more |>flexible than I have made it about where memory comes from. I am |>getting uneasy about how the standard is loosening up though. I wanted |>to place a bound on the stack so that procedures wouldn't cause |>trouble on machines with a small stack space (e.g. some microprocessor |>embedded systems). | |Your standard seems to be driving at two different goals at once: | | 1) Allow programs to use several different compression subroutines. | | 2) Require compression subroutines to be portable between | machines. | |The point you are making above seems to relate only to goal 2. I |think the real benefit of the standard will be goal 1. It's going to |be awfully hard to make all compression routines portable to all |machines anyhow. That seems to me to be, as the C language standard |would say, a ``quality of implementation'' issue for the compression |routine itself. Furthermore, an implementation on a machine with lots |of stack space could use the stack; an implementation on a machine |without stack space could use some other memory. There shouldn't be a |requirement that the same compression code work on different machines; |only that if a different machine is supported, the same functionality |is provided. | |Allowing programs to use different compression routines with a |standard interface seems to me to be a worthy goal. Specifying |details about where the memory comes from seems needlessly machine |specific. I agree that there are (at least) two separate goals and that I have not adequately separated them in my mind. I do, however, think that the standard should address machine portability. When I framed the standard, I realized that memory procurement was highly MACHINE and environment specific (e.g. stack vs heap) and also messy and so I decided that the only way to allow machine portable data compression procedures was to let the user program (which can be as non-portable as you like) work out where to get memory and what to do if it is not available. This solution transcended the problem. The result was a proposed standard that would be totally portable between machines, subject to the portability of the programming language and here I agree that the responsibility here is on the programmer to write quality code. This fact does not negate the overwhelming portability benefits of forcing the user program to go off and find memory. Where I went wrong was on LANGUAGE portability, not machine portability, for as Jean-loup Gailly pointed out, by requiring memory to be handed to the data compression procedure, I was making life very hard for implementations in certain high level languages and for explicitly procedurally recursive algorithms. Suppose that high level languages could pass around pieces of memory (which the user program could obtain from the stack or the heap) which could then be used to make type-safe variable declarations and procedure invocations. I suspect then that Jean-loup's objections would vanish and so would the stack/heap issue. Thus, tightening up stack memory and other forms of memory was really my way of disallowing memory leakage into the compression procedure when a deal on memory was supposed to have been made through the parameter list. As it stands, my proposed standard does not specify where memory comes from, only that the user program obtain some. But the standard trips on the language(/memory) issue. CONCLUSIONS: The proposed standard as it stands is totally flexible about where memory is obtained, but it leaves it up to the user program. However, feeding all the memory in through the parameter list causes such problems in high level languages that I agree that compression procedures should be allowed to get their own memory. Once this decision is made, the flood gates open and we have to deal with 1) stack vs heap, 2) memory chunk sizes and heap fragmentation, 3) failure to allocate. Under these conditions, a restriction on the stack space used (except for the one promised by the algorithm itself) is restrictive. Perhaps the algorithm could advertise how much heap and how much stack it will use. ---------------------------------------------------------------------- |From: ronald@ecl014.UUCP (Ronald van Eijck) | |How about this: | |You have a standard command 'archive' (or whatever) with a standard syntax |and a standard file format. in this format the filename etc are stored |and there is an idenification code for every compression scheme. |Now you have shared libraries that do the actual encoding/decoding. This |would be a good solution for everybody. |- The user has to know only one command syntax to do his work. |- Archive shells do not have to support a lot of different syntaxes. |- The programmer of a future compression scheme only has to make a library | with the encoding/decoding routines. As far as I can tell, what you are proposing boils down to: 1) The creation of a user-interface standard. 2) The creation of a compressed file format standard. (including algorithm ids and so on). Both good ideas for general purpose file compression. However, what I am plugging for at the moment is a standard interface to the actual algorithms themselves. This is a necessary component of your scheme as can be seen from asking the question "How do the algorithms in the libraries talk to the shell code?" If the interface standard is not separated from your scheme above, then the data compression algorithms are forced to become unnecessarily involved in management information (e.g. storing ids) which will make them harder to extract and place in tight, specific applications such as in embedded systems. |Ofcourse there this standard code has to be ported to a number of platforms |but since almost every archive programm is ported anyway this is not much |of a problem. | |Identification codes have to be obtained from some central point where all |the codes are registered. No need. Just get everyone to generate random n-bit IDs. Make n large enough so that a collision is extremely unlikely. ---------------------------------------------------------------------- |From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein) | |In article <873@spam.ua.oz> ross@spam.ua.oz.au (Ross Williams) writes: |> I'm not convinced. I think that allowing algorithm parameters will |> open up a really disgusting can of worms. I see no harm in having a |> library of exact tuned algorithms. | |I do. | |Consider, for instance, my yabba coder. There's not only the memory |parameter -m, but a blocking test length -z and ``fuzz'' -Z that affect |compression. Twiddling -z and -Z can produce up to 10% improvements on |some types of files; there's no small set of best choices, and forcing |users into fixed values of -z and -Z, let alone -m, would be insane. The reason why I am so heavy on parameters I think stems from seeing dozens of papers in which they are not specified properly. In addition, there is a need for users not to be confused by parameters. These are not strong arguments and I am being swayed on the parameter issue (so long as there are defaults and the name of the algorithm is attached to the defaults). |> If a user wants to create another tuning, he can just fiddle with the |> source code for the algorithm until he is happy and then fiddle the |> identity record to create a new algorithm. | |Uh-uh. This contradicts your stated goal of having the identification |numbers mean something for comparison purposes. Absolutely not, because if the user changed anything he would have to (according to standard) generate a new algorithm name and ID. Actually I have been thinking about the ID lately and there are many attributes upon which the ID could be based: - Algorithm. - Compressed data format (i.e. id says who can decompress). - Implementation. - Parameter settings. So I might end up with a composite ID. I'm not sure. |If you want to achieve that goal, just insist that everyone name all the |parameters (in my case, -m/-z/-Z) when quoting compression results. (It |gets even hairier when you're measuring speed and memory: for yabba, |you'd have to quote MOD and PTRS as well as the machine type.) Don't |force parameterized coders into a parameterless model. Well, maybe. But the algorithm designer has to commit himself on parameters to some extent. SAKDC has 25 parameters and I can dial up a multitude of adaptivities and memories and different order models. However, users don't want this sort of choice. With too many parameters, one is also open to the charge that one has tailored the parameter values to the data (so called "overparameterization"). |> (Another argument is that many algorithms can be implemented more |> efficiently if their parameters are statically specified). | |Hardly. Absolutely. You should read the other critics. I have you telling me that dynamic parameters are not much less efficient than static ones. Meanwhile, there's the mob who don't like passing a pointer to read/write memory for the algorithm to use because they want to avoid indirect addressing in their code and pull tricks with the address range. It is a fact that statically binding parameters allows lots of optimizations. For example: * Elimination of code in the case where a parameter has the value 1 (e.g. in LZRW1 when I decided to make the hash table of depth 1 instead of depth n, resulting in the elimination of the code to compare the lengths of matches found in the history). * Loop unrollings (short ones). * Use shift operations for parameters that can be set to powers of two. And so on. Nevertheless, the fact that static bindings allow higher speed is no argument alone against ALLOWING dynamic parameters for an algorithm author could create a version with parameters along with a couple of optimized parameterless versions. My main arguments against parameters are: * They tend to confuse users who don't know how to set the parameters. * They must be algorithm specific. It is hard to see how a user program unfamiliar with the algorithm could make sense of them. * They can overflow and be out of range and all the other things. |> The user of the algorithm doesn't have to deal with oodles of |> information to USE an algorithm. Only if they want a non-implemented |> tuning. | |One advantage of a program-based interface is that the user is given |options. If he doesn't know about the options or doesn't want to deal |with them, he doesn't use them, but they're always there for |sophisticated users. You could mimic this in your library routine by |passing in an array of options, say of { int, int } pairs, with a |method-dependent meaning. The idea of having default settings and then allowing the user to specify otherwise at his own risk dooesn't sound so bad so long as: * The settings do not affect memory consumption. * The compressor codes the settings into the compressed file so that the decompressing user doesn't have to know what the settings were. * People quote the settings they change. | [ streams versus blocks ] | |I think the real argument here is over whether the compressor should |drive the I/O or vice versa. In modem applications, your standard is |fine: the I/O has to drive the compressor. But in most applications it's |much easier to work the other way round. This may be so, but placing the compressor in charge means that you have to tell it how to do IO, which is so messy in most programming languages (requiring Ada like generic superroutines or C pointers to IO functions) as to as to completely destroy the standard. On the other hand, such ego centred compression code can nearly always be recoded into procedure+instance_block because the context information nearly always consists of just a PC, a few buffer pointers and possibly an explicit data stack. I'm not saying that there won't be a little pain in converting some algorithms over to the standard (all good standards cause lots of pain :-), but I am sure that there will be very little loss of efficiency and great benefits. |> |> DIRECTNESS: A stream interface will force many algorithms to become |> |> involved in unnecessary buffering. |> |Your block interface forces *all* algorithms to become involved in |> |unnecessary blocking. Again it sounds like you're concentrating on |> |zero-delay modem compression. I don't think compressors should be forced |> |into that model. |> The interface does not force the algorithms into blocking, it forces |> the user programs into it. | |No, it also forces the algorithms into blocking. Many algorithms don't |want to deal with blocks. They want to deal with streams. Here, you can |notch this up as a separate criticism: You expect algorithms to have |enough special cases that they work well if you give them blocks of |three bytes at a time. But most real LZW-style compressors work horribly |on such short blocks. Most compressors do not support plaintext or |escape codes, and it is unrealistic to expect otherwise. You misunderstood me. In my original proposal, I assumed that it was reasonable to expect user programs to perform buffering enough to be able to pass largish blocks to the compression procedure. The benefits of this choice were 1) the user program got to choose the buffer lengths based on how much memory was floating around, 2) The compression procedure would not have to concern itself with buffering across calls as it could assume that the user had given it a "reasonable" sized block. I never expected the algorithms to work well on very small blocks. My idea of a small block was 4K. Following your earlier criticism, I now realize that a block interface has the flaw of imposing a block structure ON THE DATA because it forces the user program to add lengths into the compressed data. This means that the interface will not be able to be used in older stream compression systems (because of compatibility problems). I didn't mind imposing a bit of buffering on the user (so as to give the nice clean memory no-flush-flag interface) but I now realize that it is not acceptable because the blocking is imposed on the data as well. I am now in favour of modifying the standard to have a flush flag (see justification document) and to impose a limit on the buffering somehow. This would allow any algorithm to be used in block fashion or stream fashion. |> I am wary of a stream interface because it is then impossible to place |> a bound on the amount of output one could get from feeding in a single |> byte of input (during either compression or decompression). | |Again it sounds like you're focusing on modems. Why? I'm NOT focussing on modems - you're the one who keeps mentioning them!!! My logic goes something like this: - We NEED to make the algorithm a procedure (not a process). - We NEED to use memory blocks to feed bytes in and out. (although the blocks can be part of a stream interface). (I hope you agree that the alternatives to these are horrific) THEN: As the algorithm is writing to an output block of memory, how do we know how much memory to give it? We could feed in 100K in one byte blocks and not get anything out, and then on the next call (in which we pass another one byte block) receive 63479 output bytes. Regardless of whether a block or a stream model is used (and I am now in agreement with stream), one needs to place a bound on the amount of buffering the algorithm is allowed to do or else there is no way to allocate the output buffers. Note: Use of memory buffers for an algorithm interface doesn't mean that the user can't read and write the buffers to files, and with a stream (but NOT process) interface there is no imposition on the data format. ---------------------------------------------------------------------- |From: gtoal@tharr.UUCP (Graham Toal) | |This rather lengthy document sounds just like my suggestion last month |for putting a wrapper round compression algorithms with a blocked |interface - the blocking allowing fast seeking to random points within |a compressed file. Your idea is not what my proposed standard is about. My standard merely seeks to define an interface between data compression algorithms and the programs that wish to use them. The memory block interface was suggested merely to simplify the interface to the algorithm but has come under fire (and rightfully so) because it does not admit stream compressors as a special case. I missed your posting on the random interface blocking thing but it sounds just like the idea I had in March 1989 to develop an invisible file compression layer using blocking. I pursued the idea, developing data structures for the file conversion layer and developing the fast LZRW1 algorithm (specifically for this purpose -- hence the fast 68000 implementation), but for commercial reasons it never made it into product form. |The basic idea is fine but do we need 500+ lines of pretentious twaddle |to describle it? Yes we do. Absolutely. I like pretentious twaddle. Without hundreds of lines of pretentious twaddle details get missed and stuffups occur. In computing, one trivial detail can sink a program of 10^6 lines of code. As a mobster would say: NO LOOSE ENDS! ---------------------------------------------------------------------- |From: gudeman@cs.arizona.edu (David Gudeman) | |I have to agree with Dan (readers of comp.lang.misc are probably |experiencing a feeling of unreality right now...) that parameters are |important. Any standard that neglects them will be ignored by |everyone who implements an algorithm with natural parameters. Note: I don't read comp.lang.misc. I am being swayed. But there would have to be default values. |How about the following: a compression procedure with parameters will |use a global variable for each one (using some naming convention based |on the name of the procedure). The user's program does not have to |refer to the global variable at all, and if it doesn't then some |natural default is used. When you name such a procedure for |comparison to other procedures, any parameters that are not mentioned |are assumed to be the defaults. I see no reason to introduce global variables. The global variables will not be accessible to user programs that have linked in dynamically and know only of the procedure to interface to. See also next answer. |One advantage of this for C is that in those cases where you want to |hard code the parameters you can #define them in the source of the |compression function without changing the function's formal |parameters. (It seemed that Dan doesn't believe there can be an |advantage in hard coding the parameters, which I find peculiar. There |are many optimizations that can be done with known values over |variables.) If the user has the source code, he can do anything anyway. The real case for parameters lies in the case of someone who only has an executable. |As to the stream/block problem, I suggest that it is unreasonable to |try to force exactly _one_ interface on all algorithms and |applications. Obviously there are applications for each and the |standard should apply to each. In other words, there should be a |standard stream interface _and_ a standard block interface. Let the |aplication programmer choose whichever interface is most appropriate |and let the compression implementer deal with any problems the given |compression procedure may have with either interface. Add a field to |the identity records that says whether the algorithm works "best" with |streams, blocks, or either. I think having two interfaces would be a big mistake. And unnecessary too as I now believe that one interface can have both block and stream as special cases (the stream is the more general). |I'd also like to repeat someone else's suggestion of seperating the |compression procedure and decompression procedure. I don't understand |how this effects having a compression method as a first-class |function. So you have two first-class functions for each compression |method instead of one, what's the big deal? For that matter, I'd add |a third function -- the one that returns the identity record. I don't |see any value to having this be part of the compression function |itself. The big deal is that having two or three sets of functions instead of one makes manipulating algorithm procedures as first class objects that much more messy. The reduction from n>1 functions to n=1 functions is a significant one because it means that user's no longer need a structure or record structure for each algorithm, just a pointer. Backing up this argument is that two of the three different calls use almost identical parameters. ---------------------------------------------------------------------- From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein) |In article <4496@optima.cs.arizona.edu> gudeman@cs.arizona.edu (David Gudeman) writes: |> How about the following: a compression procedure with parameters will |> use a global variable for each one (using some naming convention based |> on the name of the procedure). The user's program does not have to |> refer to the global variable at all, and if it doesn't then some |> natural default is used. | |I expect that Ross will disapprove of this because he can't pass the set |of parameters around as arguments if they're global variables. If you |have an array argument giving any number of { parameter, value } pairs |in any order, this problem disappears. Agreed. If the standard does allow parameters, they must go through the front door. What do you think of a string interface? E.g: compress(comp,mem,inblk,outblk,"M=4,K=91,Z=512"); It would require a little parsing by the algorithm (on the initialization call only) but would solve lots of format problems. The string format could be semi-standardized. |> (It seemed that Dan doesn't believe there can be an |> advantage in hard coding the parameters, which I find peculiar. There |> are many optimizations that can be done with known values over |> variables.) | |It depends. Parameters like hash table size (and hash function!) have to |be hard-coded if you want any efficiency, as do ``parameters'' like the |type you use for storing hash values. But I've never seen a parameter |that affects compression and has to be hard-coded. (You get about a 0.4% |speedup by specifying a single value of ``min'' in yabba, for example.) |Finally, any good language lets you say ``let foo() be a version of |bar() where argument x is 57'' and optimizes that independently. 1) I don't understand. 2) Nothing HAS to be hard coded. 3) A compiler can only do so much with static parameters. |> As to the stream/block problem, I suggest that it is unreasonable to |> try to force exactly _one_ interface on all algorithms and |> applications. | |Exactly. I imagine modem manufacturers would love this standard, for |example, while UNIX users and programmers would love a program-based |standard. Here's a challenge, Ross: Rewrite compress to conform to your |standard. If you can do that, publish it and explain why it's useful. |I'm always wary of ``standards'' that haven't been given at least two |unrelated implementations. YOU'RE the one with modems on the brain. I just chose the block interface because it makes the standard simple (avoids superroutines and other horrors - see earlier). I now agree that the imposition of the blocking of my standard on the DATA format makes this impossible (unless it can be assumed that the program is always allowed to allocate two blocks of memory the size of the file it is about to compress in which case it is TRIVIAL). Note well: If I were allowed to use a slightly different data format (in which compressed block lengths were inserted in the compressed data), and allowed a tiny (e.g. 0.00001) decrease in compression performance (for the flush (not context zap)) at the end of every large (e.g. 100K) block, there it would be EASY. My point here is that the standard fails here not because of objective convenience, but because it fools with the data format. The revised stream/block version won't. I really don't think there would be any problem with implementing UNIX compress if the flush flag were restored. As I understand it, the LZW algorithm's context is very simple consisting of just a table, an index and possibly one or two coding scalars. As a separate point, I don't see how a program-based standard could be specified at this level without assuming that languages have processes or superroutines or procedures as parameters. |> I'd also like to repeat someone else's suggestion of seperating the |> compression procedure and decompression procedure. | |Ditto. But why? If it were just a static module exporting functionality I would agree. But this is a situation where programs may wish to keep arrays of these algorithms. Why mess it all up when the parameter lists of the calls are nearly the same? ---------------------------------------------------------------------- |From: gtoal@tharr.UUCP (Graham Toal) | |In article <3406.Jun2302.12.0491@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: |:Exactly. I imagine modem manufacturers would love this standard, for |:example, while UNIX users and programmers would love a program-based |:standard. Here's a challenge, Ross: Rewrite compress to conform to your |:standard. If you can do that, publish it and explain why it's useful. |:I'm always wary of ``standards'' that haven't been given at least two |:unrelated implementations. |: |:> I'd also like to repeat someone else's suggestion of seperating the |:> compression procedure and decompression procedure. |: |:Ditto. | |That's exactly what I was trying to say when I referred rather tactlessly |(apologies, Ross) to 'pretentious twaddle'. Imposed standards just don't |work. Go ahead and implement the ideas you're suggesting, publish them |in source form on the net, and let the market decide. I've a *lot* of |time for Dan because instead of moaning about compress, he went out and |implemented something that works. (Whether it will replace compress is |something I'm sitting back and hopefully watching for...) Imposed standards can and do work all the time (witness Ada, CDs). The only ones you see failing are the ones that were so brain dead in the first place that there was a mass revolt. Most standards have problems, but at least they give you something to kick off. I feel that the criticism stating (effectively) that I am not a DOer is highly unfair. I COULD have posted to the net moaning about the lack of a standard but instead I put in the time to create one. Furthermore, I didn't just trust my judgement, I posted my standard to the net to be criticised and modified. The result has been very positive and with some of the modifications suggested I think we will have a viable standard soon. I don't see how I can "IMPLEMENT" my standard. I get the impression that you think that in order for me to do that, I would have to implement a work harness, a validation harness, and create conformant implementations of dozens of algorithms. Well I'm not going to do that and I don't think that I should have to. The standard is just something that will exist and if people want to ignore it or propose other standards then they can. (As it happens I have been using a standard similar to the one proposed (see lzrw_headers.h in my ftp archive on sirius.itd.adelaide.edu in pub/compression) for two years and have implemented five or six algorithms using it (LZRW1 is such an algorithm). These algorithms were implemented with the standard in mind though.) I believe that if the flush flag (see justificication document) were reinstated, it would be very easy to develop a conformant version of Unix compress and I don't think I have to demonstrate it. However, if you find it hard to see how it can be done I might be forced to develop it. |Incidentally, I've implemented something much like you describe. For it |to be portable, you have to (yuk - I do it but I hate it) mandate that |none of the data structures used by compression or decompression are |larger than 32k bytes, even if this reduces the compression you get. My standard proposal does not limit the memory available to a compression algorithm. Instead it requires the algorithm to specify how much it needs. If your 32K limit is because of the problem of placing an upperbound on buffering (so that memory blocks can be used for the input and output to the algorithm), then you should restate the condition. There is no need to limit the memory of the algorithm, only the degree of buffering that it does. |And seperating out compression and decompression procs is worthwhile. People keep saying this but nobody will tell me why. Is it one of the ten commandments or something? I realize that it generally neater to have three separate procedures, but given the similarity of the parameter lists and the potential ease of manipulation as a first class object I think it is a clear decision. |The biggest thing a standard could do for portability is this: | 1) Define the file format as a series of blocks; each block can | be compressed with an arbitrary algorithm to a fixed interface. | 2) Define the procedure interface exactly (eg decomp(inblock,outblock, | insize, outsize) -> status) and arrange for the encompassing | compressed file TO INCLUDE THIS C CODE so that any decoder can | extract it, link it in, and use it to extract the data. | |This might seem a stupid suggestion but it's the only one I can |think of which will allow backwards and forwards compatibility. otherwise |you get into the pkzip thing of rushing around tracking newer versions |of algorithms and having decompression programs with twenty different |algorithms built in. This is the most general way of encoding data - as a computer program. However, it does discriminate against enormous compression algorithms such as SAKDC. It would have to be a very strict, portable subset of C. Anyway, as I said in an earlier reply, I am not trying to establish a high level archiving/interface/data-format standard; I am just trying to get a procedural interface standard going, which you admit in the above that you will need anyway (although if the decode can compile and link, why can't it run the code as well. Then the decompression program can be a full unix C program.). |PS Ross, I probably wouldn't have shot from the lip in that last post |if you'd titled it 'request for comments' or even 'draft standard'... I have been using "standard", "standard proposal", etc in an undisciplined manner. I do not expect it to be adopted instantly. I just want to develop something that everyone can agree on. I believe that as soon as the standard is agreed on, new algorithms will pop up in that format and we can start collecting them and comparing them. ---------------------------------------------------------------------- |From: mjr@hussar.dco.dec.com (Marcus J. Ranum) | | Maybe I missed it, but is there any consideration given to some |kind of magic numbers to help identifying the proper means of decompressing |compressed output produced by one of the "standard" compressors? | |mjr. Yes. The designer of each algorithm tosses a coin 31 times to arrive at an identity number which the algorithm then advertises to user progams which can then code the number in compressed data. The standard does not attempt to define a data format (e.g. containing the number) as it is concerned only with the interface between the algorithm and the user program. ---------------------------------------------------------------------- --<End of this batch of replies to criticisms>--
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/28/91)
I don't mean to be cruel, Ross, but I suggest that we all ignore your ``standard'' until you have squashed two or three separate compressors (LZW, AP, and LZRW1, say) into the same mold and published the results. This is how all standards should work: *Do something first*, and publish it. Then tell us about the important parts of the interface, cutting out irrelevant details that might take up a lot of code and spending time on important restrictions that don't involve any single section of code. Say ``the routine must do blah'' if you think blah is necessary for the user. Say ``the user must not do foo'' if you don't think anyone will ever want foo in routines written to the standard. Say ``the routine may or may not do bar'' if you did it both ways. Voila, a standard. None of this goes anywhere unless you first publish working code. In article <899@spam.ua.oz> ross@spam.ua.oz.au (Ross Williams) writes: [ memory ] The obvious solution is for the compression routine to provide its own memory. This doesn't fit well with I/O-driven compression, though, or with non-parametrized standards. [ pointers ] Face it, Ross: There are a whole lot of compressors which cannot be written efficiently in Pascal or even Ada. ``An interface that even fools can use will be used only by fools''; well, by making your standard wimpier and wimpier so that the routines can be run on wimpier and wimpier machines in wimpier and wimpier languages, you pretty much guarantee that it will only be used that way. If you're going to write the standard before the implementation, you might as well take a lesson from POSIX: Use pointers where they obviously belong, don't make design decisions based on how difficult you think the interface will be to use in Ada, and postpone other language bindings to the next decade. > Perhaps the algorithm could advertise how much heap and how much stack > it will use. An excellent suggestion. (Naturally, my yabbawhap package includes a ``checkconf'' program which advertises this information, given the compile parameters.) > |Ofcourse there this standard code has to be ported to a number of platforms > |but since almost every archive programm is ported anyway this is not much > |of a problem. > |Identification codes have to be obtained from some central point where all > |the codes are registered. > No need. Just get everyone to generate random n-bit IDs. Make n large > enough so that a collision is extremely unlikely. Great. Now IDs are almost totally useless. What are you trying to accomplish with those numbers? [ parameters ] > The reason why I am so heavy on parameters I think stems from seeing > dozens of papers in which they are not specified properly. Writing a crippled compression standard is a poor way to express your dissatisfaction with how papers are written. > |> If a user wants to create another tuning, he can just fiddle with the > |> source code for the algorithm until he is happy and then fiddle the > |> identity record to create a new algorithm. > |Uh-uh. This contradicts your stated goal of having the identification > |numbers mean something for comparison purposes. > Absolutely not, because if the user changed anything he would have to > (according to standard) generate a new algorithm name and ID. Sorry I didn't make this clear. I publish yabba with a default of -m65533. I give it some ID. Joe Shmoe changes the default to -m300000 and gives it another ID. Sally Shmoe changes the default to -m300000 and gives it yet another ID. Now if Joe and Sally choose their IDs randomly, how do we tell that they're using the same method? Or, for that matter, that they've just changed a parameter in my method? > |If you want to achieve that goal, just insist that everyone name all the > |parameters (in my case, -m/-z/-Z) when quoting compression results. (It > |gets even hairier when you're measuring speed and memory: for yabba, > |you'd have to quote MOD and PTRS as well as the machine type.) Don't > |force parameterized coders into a parameterless model. > Well, maybe. But the algorithm designer has to commit himself on > parameters to some extent. SAKDC has 25 parameters and I can dial up a > multitude of adaptivities and memories and different order models. I fail to see how this matters. The algorithm is parametrized. Period. > |> (Another argument is that many algorithms can be implemented more > |> efficiently if their parameters are statically specified). > |Hardly. > Absolutely. You should read the other critics. I have you telling me > that dynamic parameters are not much less efficient than static ones. > Meanwhile, there's the mob who don't like passing a pointer to > read/write memory for the algorithm to use because they want to avoid > indirect addressing in their code and pull tricks with the address > range. Right. Remember, we're talking about *can be implemented*, not can be implemented under your awful set of restrictions. Someone might have a compression method where he compares n to a parameter, x, in the innermost loop. Does that mean that the algorithm can be implemented more efficiently if their parameters are statically specified? No, because every use of n can be ``flipped'' into x - n, and this can be compared against 0. Sure, there will exist parameters (like the hash function) where you just can't do something like this. You still don't have to make those parameters fixed. You can just tell the compiler to optimize out a separate version for those parameter values: int my_special_hash(h,c) int h; int c; { ... } static int generic_routine(input,hash) char *input; int (*hash)(); { ... all the compression code, parameterized on ``hash'', goes here ... } static int optimized_routine(input) char *input; { #pragma begin-inline-fun generic_routine return generic_routine(input,my_special_hash); #pragma end-inline-fun generic_routine } int user_visible_routine(input,hash) char *input; int (*hash)(); { if (hash == my_special_hash) return optimized_routine(input); return generic_routine(input,hash); } Of course, this is oversimplified, and with current C compilers you have to stick real_routine inside one big, ugly (but portable) macro instead of using pragmas. But it works, it avoids code duplication, and it gives you all the benefits of fixing the parameter to begin with. In fact, if the compiler inlines something like user_visible_routine(myinput,my_special_hash) in the user code, it will end up *exactly* as efficient as if my_special_hash had been used everywhere in the first place. The only disadvantage is the extra space needed for generic_routine, and that can be avoided if you stick the routines inside a library to begin with. I believe the MIPS compilers can do all of the above, including inlining from ucode libraries. They don't have the #pragma but that's just syntactic sugar. Even under good old pcc, all you lose for the parametrization is a comparison and an extra function call on each call to the routine. > It is a fact that statically binding parameters allows lots of > optimizations. Indeed... I don't complain that something can't be done. I just do it. Are you so sure now that taking parameters out of the interface allows better optimization? > * They tend to confuse users who don't know how to set the > parameters. As long as every parameter has a default, this is moot. > * They must be algorithm specific. So what? > * They can overflow and be out of range and all the other things. True. Another advantage of a program-based interface is that it can take the time to check such things. > The idea of having default settings and then allowing the user to > specify otherwise at his own risk dooesn't sound so bad so long as: > * The settings do not affect memory consumption. There you go again. How about you try to fit compress into this model. If you manage to write one version which doesn't take too much memory for -b12, but allows -b16, but doesn't have a memory-affecting parameter, and doesn't allocate memory itself anyway, I'll be very impressed. [ compressor driving I/O versus I/O driving compressor ] > This may be so, but placing the compressor in charge means that you > have to tell it how to do IO, which is so messy in most programming > languages (requiring Ada like generic superroutines or C pointers to > IO functions) as to as to completely destroy the standard. See my general comments near the top. You have that choice: either use a language which can compete with machine language, or make your standard so wimpy that nobody will want to use it. > I'm not saying that there won't be a little pain in converting some > algorithms over to the standard (all good standards cause lots of pain > :-), but I am sure that there will be very little loss of efficiency > and great benefits. So do it for compress. I want to see something that works just like compress but uses your standard for an internal interface. > |> I am wary of a stream interface because it is then impossible to place > |> a bound on the amount of output one could get from feeding in a single > |> byte of input (during either compression or decompression). > |Again it sounds like you're focusing on modems. Why? > I'm NOT focussing on modems - you're the one who keeps mentioning them!!! > My logic goes something like this: > - We NEED to make the algorithm a procedure (not a process). > - We NEED to use memory blocks to feed bytes in and out. > (although the blocks can be part of a stream interface). > (I hope you agree that the alternatives to these are horrific) No, I don't agree with either one. The argument here is about the second. You only need a bound on output because you have a block interface, and unless you wanted to handle modems you could use a stream interface instead. There's nothing horrific about ignoring modems and sticking to a stream interface. > |One advantage of this for C is that in those cases where you want to > |hard code the parameters you can #define them in the source of the > |compression function without changing the function's formal > |parameters. (It seemed that Dan doesn't believe there can be an > |advantage in hard coding the parameters, which I find peculiar. There > |are many optimizations that can be done with known values over > |variables.) > If the user has the source code, he can do anything anyway. The real > case for parameters lies in the case of someone who only has an > executable. See above. It doesn't matter whether you have the entire source or just a library. Hard-coding parameters barely speeds anything up. > Agreed. If the standard does allow parameters, they must go through > the front door. What do you think of a string interface? E.g: > compress(comp,mem,inblk,outblk,"M=4,K=91,Z=512"); Not too bad. > I really don't think there would be any problem with implementing UNIX > compress if the flush flag were restored. As I understand it, the LZW > algorithm's context is very simple consisting of just a table, an > index and possibly one or two coding scalars. Well, don't just understand it. Do it. Then tell us what you did. > As a separate point, I don't see how a program-based standard could be > specified at this level without assuming that languages have processes > or superroutines or procedures as parameters. Hmmm? What do you mean by that? > |> I'd also like to repeat someone else's suggestion of seperating the > |> compression procedure and decompression procedure. > |Ditto. > But why? Because if the procedures are in a library, separating them means you can load only one into a program which needs to do only one thing. AP coding (as in whap) makes a good example: the compressor is rather complex and takes quite a bit of code space, but the decompressor is trivial. A lot of programs just want to read compressed files... > Imposed standards can and do work all the time (witness Ada, CDs). Huh? Ada was indeed standardized before it was implemented, and it is a royal flop. Very few people use it. The CD ``standard'' is just a description of what people are doing; it works because it was implemented first. > Most standards have > problems, but at least they give you something to kick off. No. Most *implementations* have problems---most importantly, incompleteness---but at least they do something positive. Standards are negative. They restrict you, and their main failure is being overly complete. > I don't see how I can "IMPLEMENT" my standard. I get the impression > that you think that in order for me to do that, I would have to > implement a work harness, a validation harness, and create conformant > implementations of dozens of algorithms. I don't know about Graham. I think you should just prove that the established compressors can be recoded into your interface without too much work. > (As it happens I have been using a standard similar to the one > proposed (see lzrw_headers.h in my ftp archive on > sirius.itd.adelaide.edu in pub/compression) for two years and have > implemented five or six algorithms using it (LZRW1 is such an > algorithm). These algorithms were implemented with the standard in > mind though.) Right. Which is why you should choose something like LZW, which does not put a bound on output (well, 8x or so in theory, but not a fixed number of bytes), and would have a much smaller user base if it were not parametrized, and is naturally a stream algorithm. [ separating compressor and decompressor ] > Is it one of the ten commandments or something? Yes. > I have been using "standard", "standard proposal", etc in an > undisciplined manner. I do not expect it to be adopted instantly. I > just want to develop something that everyone can agree on. I don't care whether you call it a ``standard'', a ``proposed standard'', a ``document'', a ``suggestion'', or an ``article''. I am not going to write code to someone else's interface unless I can see what its strengths and weaknesses are *in practice*. (Naturally, if you develop a large enough body of code that takes advantage of the interface somehow, that's a strong point.) ---Dan
jloup@nocturne.chorus.fr (Jean-Loup Gailly) (06/28/91)
ross=ross@spam.ua.oz.au (Ross Williams) dan=brnstnd@kramden.acf.nyu.edu (Dan Bernstein) dan> [ streams versus blocks ] dan> I think the real argument here is over whether the compressor should dan> drive the I/O or vice versa. In modem applications, your standard is dan> fine: the I/O has to drive the compressor. But in most applications it's dan> much easier to work the other way round. ross> This may be so, but placing the compressor in charge means that you ross> have to tell it how to do IO, which is so messy in most programming ross> languages (requiring Ada like generic superroutines or C pointers to ross> IO functions) as to as to completely destroy the standard. I agree with Dan. When the compressor drives the I/O, it keeps its internal state when it outputs a block of compressed data and continues with the rest of the input data. If the compressor is forced to stop when the single output buffer is full, it has to save its internal state explicitly in the MEMORY parameter suggested by Ross. If the compressor state is the middle of a deeply nested loop, it becomes messy to restore this state at the next invocation of the compression routine. (The same arguments apply to decompression.) Even for modems, I don't see why the compressor cannot drive the I/O. This may require some buffering of the data to be compressed sent by the modem, but the OS usually takes care of this. If the compressor is so slow that it cannot handle the input rate on the average, you are in trouble anyway. If we ignore the details about setting the tty modem line correctly and determining the end of the input data, a command such as compress < /dev/ttyxxx > foo should work. In this example, compress drives the I/O. You don't even need pointers to IO functions. The compression/decompression algorithm can directly call functions readblock/writeblock which would have an interface similar to that of the Unix read/write calls and would be well specified by the compression standard. The compressor can easily construct macros (sorry, inline functions) on top of these functions, similar to the getc/putc macros. For a modem, readblock would return whatever bytes have already been received. For many algorithms, this interface is much easier to use than that proposed by Ross. Also, Ross justifies his unique function for compression/decompression by an array of function pointers: "this is a situation where programs may wish to keep arrays of these algorithms". Either you do or you do not have pointers to functions, but you cannot have it both ways. Although I said above that it is preferable to have the compressor drive the I/O, this does make it more difficult to try multiple compression algorithms concurrently. If your language does not give you concurrent threads or coroutines, the easiest solution is to read the input file once per algorithm. But this is costly for very high speed algorithms which are IO bound. Also, this is not applicable if the input data must be read only once (modem or pipe). Dan, do you have a suggestion for this? Jean-loup Gailly jloup@chorus.fr
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/29/91)
In article <11228@chorus.fr> jloup@nocturne.chorus.fr (Jean-Loup Gailly) writes: > Although I said above that it is preferable to have the compressor drive the > I/O, this does make it more difficult to try multiple compression algorithms > concurrently. If your language does not give you concurrent threads or > coroutines, the easiest solution is to read the input file once per algorithm. > But this is costly for very high speed algorithms which are IO bound. Also, > this is not applicable if the input data must be read only once (modem or > pipe). Dan, do you have a suggestion for this? Well, one advantage of passing in read-write function pointers to the compression routine is that you can substitute routines to read from memory instead. So the I/O problem disappears. (Actually, that may be a bit too optimistic: the UNIX read-write interface demands an extra copy into the buffer, and doesn't even allow for truly asynchronous I/O. I've never seen a compressor that wants to write to an input block, so the copies could, in principle, be avoided. The read-write interface thus involves up to a 5% loss in speed on a Convex [trust me, I've measured it!], probably around 15% on a Cray, and about 2% on a Sun 4. Boo hoo.) ---Dan