[net.micro] Squeezing program files.

hes@ecsvax.UUCP (Henry Schaffer) (06/05/85)

  Data Compression Using Static Huffman Code-Decode Tables
(Communications of the ACM 28:612-616 June 1985)
by D. R. Mcintyre and M. A. Pechura
  "Both static and dynamic Huffman coding techniques are applied to
test data consisting of 530 source programs in four different
languages.  The results indicate that, for small files, a savings of
22-91 percent in compression can be achieved by using the static
instead of dynamic techniques."
-------- my summary ------
  It is common to compress (squeeze) programs for storage and
transmission economies.  A common techniques is to count the
symbol (character) frequencies in the program file, assign
variable length codes, with more common symbols getting shorter
codes (Huffman coding), encode the program, and append the
encode table for subsequent decoding (unsqueezing).  This is
called dynamic Huffman coding.
  Static Huffman coding proceeds by starting with an encode
table which has been previously determined for programs in
general (or perhaps for that particular programming language.)
Therefore, the table need not be appended to the program.  This
more than compensates on the average for the less perfect encoding
used for each particular program - while for short programs
there were major space savings from the static method.
  In addition, the static method is considerably faster, because
the accumulation of frequencies and generation of the table is
omitted.
  Source programs without trailing blanks compress to about 50%
of original size, while source program files which include trailing
blanks compress to about 30% of original size.
---------
  It is my impression that the typical squeeze programs use the
dynamic Huffman method.  Is this so?  If it is, should we consider
changing to the static method?
--henry schaffer   n c state univ

ken@turtlevax.UUCP (Ken Turkowski) (06/07/85)

In article <1414@ecsvax.UUCP> hes@ecsvax.UUCP (Henry Schaffer) writes:
>  Source programs without trailing blanks compress [with static Huffman
>coding]to about 50% of original size, while source program files which
>include trailing blanks compress to about 30% of original size.
>---------
>  It is my impression that the typical squeeze programs use the
>dynamic Huffman method.  Is this so?  If it is, should we consider
>changing to the static method?

I think you should consider changing to Lempel-Ziv Compression (posted
to the net as "compress", version 3.0), which normally gives 70%
compression (30% of original size) to text.  The program is fast, and
adapts to whatever type of data you give it, unlike static Huffman
coding.  It usually produces 90% (!) compression on binary images.
-- 

Ken Turkowski @ CADLINC, Menlo Park, CA
UUCP: {amd,decwrl,hplabs,nsc,seismo,spar}!turtlevax!ken
ARPA: turtlevax!ken@DECWRL.ARPA

sean@ukma.UUCP (Sean Casey) (06/11/85)

In article <784@turtlevax.UUCP> ken@turtlevax.UUCP (Ken Turkowski) writes:
>I think you should consider changing to Lempel-Ziv Compression (posted
>to the net as "compress", version 3.0), which normally gives 70%
>compression (30% of original size) to text.  The program is fast, and
>adapts to whatever type of data you give it, unlike static Huffman
>coding.  It usually produces 90% (!) compression on binary images.

WHOA BUDDY!

Lempel-Ziv doesn't do NEARLY that well.  We've been using it for
months, and we've found that text and program sources usually get about
55-65% compression, while binaries get about 45-55% compression.  This
is encountered in the optimal case of compressing a large archive of
files.  As files get smaller, expecially as they drop below about 8k in
size, compression worsens. I seriously doubt that most binaries contain
only 10% of unambiguous information, much less being compressable to
that size.

-- 

-  Sean Casey				UUCP:	{cbosgd,anlams,hasmed}!ukma!sean
-  Department of Mathematics		ARPA:	ukma!sean@ANL-MCS.ARPA	
-  University of Kentucky

jordan@ucbvax.ARPA (Jordan Hayes) (06/12/85)

I think the high compression percentage quoted for binary images may
be due to "compressing" un-stripped binary images. When doing a
uuencode of such binaries, there's a lot of blank space, and sometimes
that alone can reach 30% of an image.

Also, I got a strange output from compress the other day, something
along the lines of " -25% compression -- not changed"

Admittedly I was trying to compress an encrypted tar file, but ...

/jordan
-------
ARPA:	jordan@berkeley.ARPA
UUCP:	..!ucbvax!jordan

sean@ukma.UUCP (Sean Casey) (06/14/85)

In article <8072@ucbvax.ARPA> jordan@ucbvax.UUCP (Jordan Hayes) writes:
>Also, I got a strange output from compress the other day, something
>along the lines of " -25% compression -- not changed"
>
>Admittedly I was trying to compress an encrypted tar file, but ...

Compress normally refuses to compress a file that would turn out to be
longer after compression than before.  It can be forced to do it anyway
with the -F option.

As documented, compress seems to work on just about everything except
encrypted and already-compressed files.

It's interesting to note that compressing a file before encryption
makes it much harder to decrypt.

-- 

-  Sean Casey				UUCP:	{cbosgd,anlams,hasmed}!ukma!sean
-  Department of Mathematics		ARPA:	ukma!sean@ANL-MCS.ARPA	
-  University of Kentucky

jbuck@epicen.UUCP (Joe Buck) (06/15/85)

compress will almost always fail when trying to compress an encrypted file,
at least if the encryption is good. A good encryption algorithm removes
correlation between adjacent characters and makes all 256 bytes roughly
equally probable. So compress can't do anything with encrypted data. You
could compress first, then encrypt though.

Another place where compress fails badly is on sampled speech files,
where two bytes are used to hold each speech sample. The reason is that
compress can't exploit the situation when numerical values vary slowly;
it can only deal with repeated patterns. Other algorithms are much more
successful at compressing speech.

Similar problems exist on files containing float data (unless there are
lots of repeated values).

But it's great for text and executable images, the latter because they
have lots of zeros and because some instructions are very common with
respect to others.
-- 
Joe Buck		Entropic Processing, Inc. (epi)
	  		{ucbvax,ihnp4}!dual!epicen!jbuck