[comp.binaries.ibm.pc.d] compressing compressed stuff

randy@umn-cs.cs.umn.edu (Randy Orrison) (05/04/88)

This whole conversation has me puzzled.  Witness:

	pkarc:  If compressing a file (by any method) does not result in
		a saving of space, it is stored verbatim in the archive.

	zoo:	putting a zoo arhive into another zoo archive results in
		"0%" compression and the file getting only marginally larger
		(index information?)

	compress:	compressing a compressed file (e.g. a zoo archive)
		results only in wasting some time.  A .Z file is not created.
		(my test resulted in "-29.30% compression, file not compressed")

Summary:  all these methods of compression DO NOT COMPRESS if the result
would be larger than the original.  There is NO HARM in compressing a zoo
archive containing a pkarc of a binary file (except for the added index
information at each step).

The ONLY problem I can see in doing this is if the news software that does
compression isn't smart enough to check if the result is larger than the
original.  If it isn't smart enough, it should be.  There's no excuse for
not checking that simple condition.

	-randy
-- 
Randy Orrison, Control Data, Arden Hills, MN		randy@ux.acss.umn.edu
(Anyone got a Unix I can borrow?)   {ihnp4, seismo!rutgers, sun}!umn-cs!randy
The best book on programming for the layman is "Alice in Wonderland";
but that's because it's the best book on anything for the layman.

jerry@oliveb.olivetti.com (Jerry Aguirre) (05/06/88)

In article <5198@umn-cs.cs.umn.edu> randy@umn-cs.UUCP (Randy Orrison) writes:
>The ONLY problem I can see in doing this is if the news software that does
>compression isn't smart enough to check if the result is larger than the
>original.  If it isn't smart enough, it should be.  There's no excuse for
>not checking that simple condition.

Yes, there is an excuse.  The "news software" in question is "sendbatch"
which is just a shell script.  (No, that is not the excuse.)  It winds
up doing a pipe of:

	batch batch_file | compress | uux bla!rnews

Now compress is perfectly capable of telling you whether it's output was
compressed or not.  But in this case by the time it tells you the input
is already gone!  Doing this would require writing batch and compress
output to files.  Then you could decide what to send based on whether
compress helped or not.

If you are willing to put up with the extra overhead of creating temp
files it is trivial to modify the script to compress only when
effective.  I guess that someone who was paying lots of money for the
transmission line would find this worthwhile.

Actually you could avoid the extra overhead.  You have batch run the
compress and check the exit status.  Batch already checkpoints itself so
it could re-create the batch if the compress failed.  Compress writes
the output to the same file system as /usr/spool/uucp and executes "uux"
with the "-l" option so it makes a hard link to the file instead of
copying it.  You can then delete the original link to the temp file.

All this starts to get a little non-portable.....

kent@happym.UUCP (Kent Forschmiedt) (05/07/88)

In article <5198@umn-cs.cs.umn.edu> randy@umn-cs.UUCP (Randy Orrison) writes:
[ says that compress and its cousins know better than to mess with a
  file if it won't get smaller.  Suggests that news software won't
  compress, or shouldn't, if stuff is already compressed, so what's
  the problem? ]


News articles are usually collected into batches before 
transmission. 

Some sites eat long articles, so there is a practical limit of
around 50 or 60k on article size.

So, the problem is:

Since many sites transmit batches that are several times that size, 
even the longest article will generally get collected into a file 
with other articles.  When the file is compressed before sending to 
another site, a file which is already 40% compressed and uuencoded 
will get enough smaller to satisfy compress, but it will not be as 
small as if the big article were only uuencoded.  This is true 
whether the original is random binary or 6 bit text.  Compressing 
twice makes it worse. 

-- 
--
	Kent Forschmiedt -- kent@happym.UUCP, tikal!camco!happym!kent
	Happy Man Corporation  206-282-9598

linhart@topaz.rutgers.edu (Mike Threepoint) (05/13/88)

randy@umn-cs.UUCP (Randy Orrison) writes:
-=> This whole conversation has me puzzled.  Witness:

-=> 	pkarc:  If compressing a file (by any method) does not result in
-=> 		a saving of space, it is stored verbatim in the archive.

-=> 	zoo:	putting a zoo arhive into another zoo archive results in
-=> 		"0%" compression and the file getting only marginally larger
-=> 		(index information?)

-=> 	compress:	compressing a compressed file (e.g. a zoo archive)
-=> 		results only in wasting some time.  A .Z file is not created.
-=> 		(my test resulted in "-29.30% compression, file not compressed")

If compress can't make any given file smaller, it won't.  Really short
files also fall into this category.  PKARC and zoo won't either, and
just store it verbatim with a index header containing file name,
date/time, CRC, and other vitals.  So they would both store other
archives uncompressed under most circumstances (sometimes PKARC
squeaks out a 2% Huffman on other ARC files).

Nested ARC's occur often in large systems like BBS software, usually
with a batch file to reproduce the directory structure and extract the
enclosed ARC's into the appropriate directories.  I wouldn't expect
anyone to do this in zoo, since the directory structure can be stored
much more simply.  (zoo is a much better choice for these systems, but
it's not the standard and PKARC compresses better.)

-=> Summary:  all these methods of compression DO NOT COMPRESS if the result
-=> would be larger than the original.  There is NO HARM in compressing a zoo
-=> archive containing a pkarc of a binary file (except for the added index
-=> information at each step).

Exactly.  (But see above parenthetic about PKARC.)

-=> The ONLY problem I can see in doing this is if the news software that does
-=> compression isn't smart enough to check if the result is larger than the
-=> original.  If it isn't smart enough, it should be.  There's no excuse for
-=> not checking that simple condition.

As I understand it, compress is generally used, and compress checks
unless -f is specified.  Along those lines, though, I've seen PKARC
crunch a small file at 0% savings.  That's ridiculous.  Just storing
it uncompressed would obviously be more efficient (and faster to
extract).
-- 
"...billions and billions..."			| Mike Threepoint (D-ro 3)
			-- not Carl Sagan	| linhart@topaz.rutgers.edu
"...hundreds if not thousands..."		| FidoNet 1:107/513
			-- Pnews		| AT&T +1 (201)878-0937

hyc@math.lsa.umich.edu (Howard Chu) (05/25/88)

Just thought I'd play around a bit and see what all this meant... The following
summarizes a few minutes of messing around with uuencode, compress, and compact
on a Sun 3/260. While I'm only testing a single file, I'm sure it makes a pretty
convincing worst case test...

For reference, compress uses a 16 bit Lempel-Ziv-Welch compression scheme, and
compact uses an optimized Huffman Squeeze algorithm (which doesn't store the
decoding tree in the compacted file). This can almost be directly related to
the ARC program, with the exception that ARC performs run-length-encoding on
input data before feeding to any of the other compression algorithms. (PKARC
doesn't do this, by the way.)

582556 May 24 15:28 vmunix              plain binary file
472774 May 24 15:36 vmunix.C		compacted. (huffman squeeze)
661987 May 24 15:46 vmunix.C.uue	compacted, then uuencoded
571778 May 24 15:46 vmunix.C.uue.Z	compacted, uuencoded, compressed
365675 May 24 15:28 vmunix.Z		compressed (16 bit)
358631 May 24 15:39 vmunix.Z.C		compressed, then compacted
502186 May 24 16:01 vmunix.Z.C.uue	compressed, compacted, uuencoded
449395 May 24 16:01 vmunix.Z.C.uue.Z	  "   "  , compressed
512047 May 24 15:30 vmunix.Z.uue	uuencoded after compression
445229 May 24 15:30 vmunix.Z.uue.Z	compressed, uuencoded, compressed again
815678 May 24 15:28 vmunix.uue		uuencoded, no compression
462100 May 24 15:31 vmunix.uue.Z	compressed after uuencoding
460239 May 24 15:43 vmunix.uue.Z.C	uuencoded, compressed, then compacted

A few things worth noting:
  - while the results aren't always dramatic, (and they certainly aren't, in
  this case) a Huffman Squeeze will always reduce the size of a file already
  compressed by some form of Lempel-Ziv compression.
  - compressing, then uuencoding, is obviously better than just uuencoding.
  - since Lempel-Ziv compression typically yields 50% compression, and
  uuencoding gives about 33% expansion, the result will still be smaller
  than the original file.
  - if your news software also tries to perform compression, it's still a
  good idea to compress, then uuencode. Compare:
	445229 May 24 15:30 vmunix.Z.uue.Z
	462100 May 24 15:31 vmunix.uue.Z
  - there is no vmunix.Z.Z or vmunix.C.Z in the above list. Immediately
  recompressing a compressed file is always a bad idea.

Your mileage will vary....
--
  /
 /_ , ,_.                      Howard Chu
/ /(_/(__                University of Michigan
    /           Computing Center          College of LS&A
   '              Unix Project          Information Systems