hes@ecsvax.UUCP (Henry Schaffer) (06/05/85)
Data Compression Using Static Huffman Code-Decode Tables (Communications of the ACM 28:612-616 June 1985) by D. R. Mcintyre and M. A. Pechura "Both static and dynamic Huffman coding techniques are applied to test data consisting of 530 source programs in four different languages. The results indicate that, for small files, a savings of 22-91 percent in compression can be achieved by using the static instead of dynamic techniques." -------- my summary ------ It is common to compress (squeeze) programs for storage and transmission economies. A common techniques is to count the symbol (character) frequencies in the program file, assign variable length codes, with more common symbols getting shorter codes (Huffman coding), encode the program, and append the encode table for subsequent decoding (unsqueezing). This is called dynamic Huffman coding. Static Huffman coding proceeds by starting with an encode table which has been previously determined for programs in general (or perhaps for that particular programming language.) Therefore, the table need not be appended to the program. This more than compensates on the average for the less perfect encoding used for each particular program - while for short programs there were major space savings from the static method. In addition, the static method is considerably faster, because the accumulation of frequencies and generation of the table is omitted. Source programs without trailing blanks compress to about 50% of original size, while source program files which include trailing blanks compress to about 30% of original size. --------- It is my impression that the typical squeeze programs use the dynamic Huffman method. Is this so? If it is, should we consider changing to the static method? --henry schaffer n c state univ
ken@turtlevax.UUCP (Ken Turkowski) (06/07/85)
In article <1414@ecsvax.UUCP> hes@ecsvax.UUCP (Henry Schaffer) writes: > Source programs without trailing blanks compress [with static Huffman >coding]to about 50% of original size, while source program files which >include trailing blanks compress to about 30% of original size. >--------- > It is my impression that the typical squeeze programs use the >dynamic Huffman method. Is this so? If it is, should we consider >changing to the static method? I think you should consider changing to Lempel-Ziv Compression (posted to the net as "compress", version 3.0), which normally gives 70% compression (30% of original size) to text. The program is fast, and adapts to whatever type of data you give it, unlike static Huffman coding. It usually produces 90% (!) compression on binary images. -- Ken Turkowski @ CADLINC, Menlo Park, CA UUCP: {amd,decwrl,hplabs,nsc,seismo,spar}!turtlevax!ken ARPA: turtlevax!ken@DECWRL.ARPA
sean@ukma.UUCP (Sean Casey) (06/11/85)
In article <784@turtlevax.UUCP> ken@turtlevax.UUCP (Ken Turkowski) writes: >I think you should consider changing to Lempel-Ziv Compression (posted >to the net as "compress", version 3.0), which normally gives 70% >compression (30% of original size) to text. The program is fast, and >adapts to whatever type of data you give it, unlike static Huffman >coding. It usually produces 90% (!) compression on binary images. WHOA BUDDY! Lempel-Ziv doesn't do NEARLY that well. We've been using it for months, and we've found that text and program sources usually get about 55-65% compression, while binaries get about 45-55% compression. This is encountered in the optimal case of compressing a large archive of files. As files get smaller, expecially as they drop below about 8k in size, compression worsens. I seriously doubt that most binaries contain only 10% of unambiguous information, much less being compressable to that size. -- - Sean Casey UUCP: {cbosgd,anlams,hasmed}!ukma!sean - Department of Mathematics ARPA: ukma!sean@ANL-MCS.ARPA - University of Kentucky
jordan@ucbvax.ARPA (Jordan Hayes) (06/12/85)
I think the high compression percentage quoted for binary images may be due to "compressing" un-stripped binary images. When doing a uuencode of such binaries, there's a lot of blank space, and sometimes that alone can reach 30% of an image. Also, I got a strange output from compress the other day, something along the lines of " -25% compression -- not changed" Admittedly I was trying to compress an encrypted tar file, but ... /jordan ------- ARPA: jordan@berkeley.ARPA UUCP: ..!ucbvax!jordan
sean@ukma.UUCP (Sean Casey) (06/14/85)
In article <8072@ucbvax.ARPA> jordan@ucbvax.UUCP (Jordan Hayes) writes: >Also, I got a strange output from compress the other day, something >along the lines of " -25% compression -- not changed" > >Admittedly I was trying to compress an encrypted tar file, but ... Compress normally refuses to compress a file that would turn out to be longer after compression than before. It can be forced to do it anyway with the -F option. As documented, compress seems to work on just about everything except encrypted and already-compressed files. It's interesting to note that compressing a file before encryption makes it much harder to decrypt. -- - Sean Casey UUCP: {cbosgd,anlams,hasmed}!ukma!sean - Department of Mathematics ARPA: ukma!sean@ANL-MCS.ARPA - University of Kentucky
jbuck@epicen.UUCP (Joe Buck) (06/15/85)
compress will almost always fail when trying to compress an encrypted file, at least if the encryption is good. A good encryption algorithm removes correlation between adjacent characters and makes all 256 bytes roughly equally probable. So compress can't do anything with encrypted data. You could compress first, then encrypt though. Another place where compress fails badly is on sampled speech files, where two bytes are used to hold each speech sample. The reason is that compress can't exploit the situation when numerical values vary slowly; it can only deal with repeated patterns. Other algorithms are much more successful at compressing speech. Similar problems exist on files containing float data (unless there are lots of repeated values). But it's great for text and executable images, the latter because they have lots of zeros and because some instructions are very common with respect to others. -- Joe Buck Entropic Processing, Inc. (epi) {ucbvax,ihnp4}!dual!epicen!jbuck