jel@wet.UUCP (John Levine) (06/05/91)
Hello, all! I was recently meditating on the hassle of having to deal with all the .zip .arc .tar.Z .zoo et cetera different file comression formats, and I have a suggestion. This suggestion concerns the implementation of data decompression routines rather than their underlying algorithms. Different compression algorithms seem to work best for different data. Indeed, I can imagine someone examining a file and compressing it "by hand" to a small fraction of its original size, using common sense and ad hoc techniques and a priori knowledge about its contents. And doing a better job than a program could. So, why not format a compressed file as follows? Put into the header of the compressed file the decompressor itself, in a terse but *general* machine language. Since most of the expense of data compression seems to be related to finding patterns in the original data and then coding them as the compressed file, decompression is usually just a matter of following the directions for reconstructing the original file. Fast, in other words. The decompressor header could even be interpreted, though it would be faster to define it in such a way that it could be compiled on the fly like some so-called machine-independent assemblers. Files big enough to need compression are usually much bigger than a description of their compression format, so there would be little overhead there. In fact, there is a PC compressor--I think it's PKZIP-- which can produce compressed files such that to recover the original file you just run the compressed file! This machine-independant machine language header would be completely general, perhaps with some special operations such as "return the i'th most frequently occuring English word", according to some standard table. The advantage of doing things with this generality is that there would be a single data compression format, without regard to the technique used for the actual compression. So whether the actual algorithm is LZ or arithmetic or fractal or xyz (which will be invented in late 1998), there is still ONE decompression program you run to recover the original file, whether it's audio or text or graphics or numbers from your physics experiment. An important large static file at some distribution site might even be compressed by hand, using a compression toolkit to analyze the peculiar regularities of that file and tailor a compression scheme to it. Still, as far as the user is concerned it's all in the same format. You can call this format "Chameleon". :-) So, what's wrong with this idea? Comments, criticism, countersuggestions, improvements all welcome. jel @ (37 54 N / 122 18 W) jel@sutro.sfsu.edu ... uunet!sun!hoptoad!wet!jel