[net.unix] Compaction Algorithm

rwl@uvacs.UUCP (Ray Lubinsky) (02/25/86)

> > I am in need of a packing algorithm which works better than the
> > PACK Utility of UNIX.  I have also looked at COMPRESS ( developed
> > at University of Utah ).  COMPRESS works great if distinct number
> > of input bytes is small.  But if the distinct input bytes reach
> > 256 ( binary data ), PACK works better than COMPRESS.  With PACK
> > I am getting a saving of 20-25%.  If anybody has an algorithm
> > that would do better in packing "load modules", I would like to
> > know about it.
> 
> Are you sure you have the latest version of "compress"?  I tried "pack"ing
> and "compress"ing "/usr/bin/suntools" (which is a BMF executable image) and
> "compress" did significantly better than "pack" did (it took significantly
> more time doing it, but that's life).  Remember, Lempel-Ziv compression will
> discover frequently-occurring sequences of bytes, and unless your machine
> has all one-byte instructions you're likely to get some multi-byte sequences
> occurring frequently.
> -- 
> 	Guy Harris
> 	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
> 	guy@sun.arpa	(yes, really)

Hold on thar, Babaloo!  If you mean better in terms of *byte* savings, I
imagine ``compress'' could easily do better than ``pack''.  But if you're
talking about *block* savings, I'm dubious that ``pack'' will be much improved
upon.  I don't know about your system, but my Vax running 4.2 BSD permits
internal fragmentation, so it's disk block savings that count.

Now, I'm not entirely familiar with ``compress'', but I can compare with
``compact''.  When I created a file of 2000 bytes (one identical character
per line plus a newline), ``compress'' boasted of > 85% compression, while
pack only claimed 50% compression, but each of the results consumed the same
amount of blocks.  Hence the same effective compression.

Sqeezing a few extra bytes out of a file can only be worth it if it results in
reducing by the 1K minimum data block size (2 basic file system blocks).  Is
this often the case?  (On my system, ``pack'' runs considerably faster than
``compact'', so the choice is easy.)
-- 

Ray Lubinsky	University of Virginia, Dept. of Computer Science
		UUCP: ...!cbosgd!uvacs!rwl or ...!decvax!mcnc!ncsu!uvacs!rwl

guy@sun.uucp (Guy Harris) (03/01/86)

> Hold on thar, Babaloo!  If you mean better in terms of *byte* savings, I
> imagine ``compress'' could easily do better than ``pack''.  But if you're
> talking about *block* savings, I'm dubious that ``pack'' will be much
> improved upon.  I don't know about your system, but my Vax running 4.2 BSD
> permits internal fragmentation, so it's disk block savings that count.

1) I assume you "don't know about (my) system" because your news reading
program doesn't display the "Organization:" line.  FYI, as it's sent out
from this site, it's "Sun Microsystems, Inc."  I'm sure you can figure out
what system my machine is running from that.

2) Most operating systems do internal fragmentation.  It's hardly specific
to VAXes running 4.2BSD.  Even <SPOILER WARNING> Suns running 4.2BSD do it.
(It's also hardly a question of "permitting" internal fragmentation.  It's
not as if UNIX gives you a choice.  If you don't want internal
fragmentation, you phave to put several files together into something like
an "ar" or "tar" archive".)

> Now, I'm not entirely familiar with ``compress'', but I can compare with
> ``compact''.  When I created a file of 2000 bytes (one identical character
> per line plus a newline), ``compress'' boasted of > 85% compression, while
> pack only claimed 50% compression, but each of the results consumed the same
> amount of blocks.  Hence the same effective compression.

Big deal.  A 2000-byte file is two 1K frags.  The person was asking about
"load modules", by which I presume he means executable images.  "/bin/cat"
is 24 1K frags on my system ("'cat -v' considered harmful" flames to
/dev/null, please, I'm just reporting the facts).  They were probably
interested in compressing large executable images, so the internal
fragmentation is probably a very small percentage of the file size, and thus
the savings in blocks is infinitesimally different from the savings in
bytes.

OK, let's terminate the debate with some hard data:

Original file:

 664 -rwxr-xr-x  1 root       671744 Feb 19 12:25 /usr/bin/suntools

"pack":

 520 -rwxr-xr-x  1 guy        519660 Feb 28 18:33 suntools.z

pack: suntools: 22.6% Compression

real	1m2.98s
user	0m46.00s
sys	0m12.91s

"compact":

 520 -rwxr-xr-x  1 guy        519916 Feb 28 18:55 suntools.C

suntools:  Compression :   22.60%

real	16m17.15s
user	12m44.50s
sys	0m15.15s

"compress":

suntools: Compression: 43.18% -- replaced with suntools.Z

real	1m39.90s
user	1m25.65s
sys	0m4.63s

 384 -rwxr-xr-x  1 guy        382395 Feb 28 18:36 suntools.Z

It seems "compress" really does provide a significant improvement on the
*block* usage of the file in question, however "dubious" you may be of those
results.  "compact" and "pack" get results which are infinitesmally
different.

BTW, I tried "compress", "compact", and "pack" on a 2048-byte file in the
exact same format you describe.  "compress" reported a 95% compression,
"compact" reported an 81% compression, and "pack" reported an 81%
compression.  All three reduced the file from two frags to one.  I then
tried it on a 2000-byte file in the format you describe, and got the exact
same results as on the 2048-byte file.  Your numbers look flaky.

> Sqeezing a few extra bytes out of a file can only be worth it if it results
> in reducing by the 1K minimum data block size (2 basic file system blocks).
> Is this often the case?

Yes, even on 2000-byte files consisting of 1000 identical lines.

> (On my system, ``pack'' runs considerably faster than ``compact'', so the
> choice is easy.)

Since "pack" also gives results which are as good as those "compact" gives,
the choice is *very* easy unless you need something which makes only one
pass over its input (e.g. because it's reading from a pipe), in which case
"pack" won't cut it.  "compress" is only a small amount slower than "pack"
in CPU usage (at least when compared to the speed difference between "pack"
and "compact"), gives much better results than "pack" or "compact", and
makes only one pass over its input.  The only disadvantage is that it tends
to eat virtual (and physical) memory; as Peter Honeyman once put it, "more
than two compresses makes disks dance!"  I don't care, since my machine is a
one-user machine, but on a multi-user machine this may make a difference.
I'm also not sure whether the latest "compress" uses memory that freely.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.arpa	(yes, really)

g-rh@cca.UUCP (Richard Harter) (03/01/86)

Just a note on packing.  Recently people in one of the newsgroups were
speculating about one character predictor packing.  We implemented it
and found that it works as well as pack (for text files).  The idea is
that you make a first pass through the file and determine, for each
character, its most probable successor.  You make a second pass through
the file and record one bit for each character, F if the character is not
followed by its most probable successor, and T if it is.  You also record
the incorrect guesses.  You store the file as a bit array followed by the
wrongly predicted characters.  The point of these shenanigans is that the
unpacking is fast.  In our context the determination of the predicted
characters only has to be done once so packing is also fast.  Our results
are that this style of packing is as efficient as PACK for source code
and English text.  The bit array is an automatic 12.5% penalty.  Prediction
rate is around 50% for source code (depending on language and style --
higher for operating systems that store files with trailing blanks in
fixed records).  Not bad for quick and dirty.  However it is of no use
for object code.

	Richard Harter, SMDS Inc.

wrs@pupthy.UUCP (03/01/86)

In article <226@uvacs.UUCP> rwl@uvacs.UUCP (Ray Lubinsky) writes:

(On the relative merits of pack/compact/compress...)

> Hold on thar, Babaloo!  If you mean better in terms of *byte* savings, I
> imagine ``compress'' could easily do better than ``pack''.  But if you're
> talking about *block* savings, I'm dubious that ``pack'' will be much improved
> upon.  I don't know about your system, but my Vax running 4.2 BSD permits
> internal fragmentation, so it's disk block savings that count.
>
> Now, I'm not entirely familiar with ``compress'', but I can compare with
> ``compact''.  When I created a file of 2000 bytes (one identical character
> per line plus a newline), ``compress'' boasted of > 85% compression, while
> pack only claimed 50% compression, but each of the results consumed the same
> amount of blocks.  Hence the same effective compression.
>
> Sqeezing a few extra bytes out of a file can only be worth it if it results in
> reducing by the 1K minimum data block size (2 basic file system blocks). ...

The above comments are pretty much true.  Since disk space is allocated in
integral block chunks, reducing the size of a file below the minimal disk
storage size does nothing much for you.  I.E. a 2056 byte file and its packed
1378 byte (say) compressed version both take 2k of physical disk space.  
(Although why you would want to compress a 2k file for other than test purposes,
I don't know.)

However, if the file size is sufficiently large, the "file blocking" becomes
pretty much irrelevant.  We have been using compress to reduce the size of some
of our data file, which have had sizes up to 100Mb (Yes, Mega-bytes!)

Tests run on our Ridge (disk blocks = 4k) gave the following results:

3Mb data file (text):

    Original:	3956579b ==> 966 blocks
    Compress:	 759914b ==> 186 blocks = 81% compression	 157.0 cpu sec
    Compact:	2002303b ==> 489 blocks = 49% compression	1452.0 cpu sec
    Pack:	1998736b ==> 488 blocks = 49% compression	  98.7 cpu sec

100kb executable (stripped):

    Original:	 186960b ==>  46 blocks	
    Compress:	  87005b ==>  22 blocks = 52% compression	   9.5 cpu sec
    Compact:	 126561b ==>  31 blocks = 33% compression	  88.5 cpu sec
    Pack:	 126437b ==>  31 blocks = 33% compression	   4.9 cpu sec

(I would have tried one of our 100Mb files, but didn't have one around)

------------------------------------------------------------------------
William R. Somsky                          Physics Dept ; Princeton Univ
{ihnp4,princeton}!pupthy!wrs             PO Box 708 ; Princeton NJ 08544

mjl@ritcv.UUCP (Mike Lutz) (03/01/86)

In article <226@uvacs.UUCP> rwl@uvacs.UUCP (Ray Lubinsky) writes:
>I don't know about your system, but my Vax running 4.2 BSD permits
>internal fragmentation, so it's disk block savings that count.

Agreed that block savings are what we want, but wait a bit:

>Now, I'm not entirely familiar with ``compress'', but I can compare with
>``compact''.  When I created a file of 2000 bytes (one identical character
>per line plus a newline), ``compress'' boasted of > 85% compression, while
>pack only claimed 50% compression, but each of the results consumed the same
>amount of blocks.  Hence the same effective compression.

So who goes about compressing 2K files as a matter of course?  With a 1K/8K
file system, your experiment is limited by the granularity of the
file allocation to the point where the results are meaningless.  You're
limited to only 3 possible outcomes:

	0 blocks (if there is no information content; highly unlikely).
	1 block (if compression gets the byte size down to 1 - 1024 bytes).
	2 blocks (if compression gets the byte size down to 1025-2000 bytes).

Basically, you get either 50% compression or 0% compression, no matter what
algorithm you use.

Let's look at a binary file with more meat: libc.a The following list
shows the size in blocks of our 4.2 libc.a, and the results of compressing
it using compress, pack, and compact (the latter being a notable CPU
hog):

 112 libc.a
  80 libc.a.pack
  80 libc.a.compact
  64 libc.a.compress

Note that libc.a.compress saves an additional 16 blocks over the 2
Huffman based programs.  Given that compress takes only about 24% more
CPU time than pack, and the compressed file requires only 80% of the
space of the packed file, I choose compress with no hesitation
for archival applications, (which is why we compress, not
pack, our {net,mod}.sources archives).

* * * * *
On a related note, try encrypting a file (using crypt(1)) followed by
your favorite compression filter, and watch the file expand.  Not
surprising, as crypt spreads the information content evenly over the
entire file, but interesting nonetheless.  Of course, compression followed
by encryption does not suffer from this problem.
-- 
Mike Lutz	Rochester Institute of Technology, Rochester NY
UUCP:		{allegra,seismo}!rochester!ritcv!mjl
CSNET:		mjl%rit@csnet-relay.ARPA

randy@chinet.UUCP (Randy Suess) (03/02/86)

In article <226@uvacs.UUCP> rwl@uvacs.UUCP (Ray Lubinsky) writes:
>Hold on thar, Babaloo!  If you mean better in terms of *byte* savings, I
>imagine ``compress'' could easily do better than ``pack''.  

>Sqeezing a few extra bytes out of a file can only be worth it if it results in
>reducing by the 1K minimum data block size (2 basic file system blocks

	Ahhh!, but transfering a 50-60% compressed file long distance at
1200 baud makes me much happier than transfering one at 33% packed.
Every little bit helps.  

-- 
.. that's the biz, sweetheart...
Randy Suess
chinet - Public Access UN*X
(312) 545 7535 (h) (312) 283 0559 (system)
..!ihnp4!chinet!randy

jdz@wucec2.UUCP (03/04/86)

 >> > I am in need of a packing algorithm which works better than the
 >> > PACK Utility of UNIX.  I have also looked at COMPRESS ( developed
 >> > at University of Utah ).

 >> Are you sure you have the latest version of "compress"?  I tried "pack"ing
 >> and "compress"ing "/usr/bin/suntools" (which is a BMF executable image) and
 >> "compress" did significantly better than "pack" did (it took significantly
 >> more time doing it, but that's life).

 >Hold on thar, Babaloo!  If you mean better in terms of *byte* savings, I
 >imagine ``compress'' could easily do better than ``pack''.  But if you're
 >talking about *block* savings, I'm dubious that ``pack'' will be much improved
 >upon.  I don't know about your system, but my Vax running 4.2 BSD permits
 >internal fragmentation, so it's disk block savings that count.

 >Now, I'm not entirely familiar with ``compress'', but I can compare with
 >``compact''.  When I created a file of 2000 bytes (one identical character
 >per line plus a newline), ``compress'' boasted of > 85% compression, while
 >pack only claimed 50% compression, but each of the results consumed the same
 >amount of blocks.  Hence the same effective compression.
 >
 >Squeezing a few extra bytes out of a file can only be worth it
 >if it results in
 >reducing by the 1K minimum data block size (2 basic file system blocks).  Is
 >this often the case?  (On my system, ``pack'' runs considerably faster than
 >``compact'', so the choice is easy.)

Ahem. When one compresses a 2 block file, one would not expect savings of
more than one block. Can we get serious here? Compression of 2kbyte files
is a non-problem; compression of 20kb and 200kb files is. 85% of 200 blocks
gives me a savings of 170 blocks (rounding off, folks - don't flame) as
opposed to 50% which saves only 100 blocks.

When transmitting by phone, EVERY BYTE counts. When storing on tape, bytes
may count, depending on tape program. When storing on disk, blocks and frags
(for 4.2BSD) count. But you need to be considering reasonable size files before
you start talking about the lack of difference between compression tools.

Yeah, compress is slower than pack. But the phone line is slower still...

-- 
Jason D. Zions			...!{seismo,cbosgd,ihnp4}!wucs!wucec2!jdz
Box 1045 Washington University
St. Louis MO 63130  USA		(314) 889-6160
Nope, I didn't say nothing. Just random noise.

tweten@ames-nas.arpa (03/04/86)

From: Guy Harris <guy%sun.uucp@BRL.ARPA>

	The only disadvantage is that it tends to eat virtual (and physical)
	memory; as Peter Honeyman once put it, "more than two compresses makes
	disks dance!"  I don't care, since my machine is a one-user machine,
	but on a multi-user machine this may make a difference.  I'm also not
	sure whether the latest "compress" uses memory that freely.

Compress version 4.0 reduced its memory requirement to the point that I
was able to port the full-blown (16-bit compressed codes) version to PC/MS-DOS,
using the Microsoft C compiler.  Compiled for 16-bit codes, it requires about
460 K bytes of user memory on a PC.  Naturally, the 12-bit code compilation
takes much less.

To anticipate any inquiries, source code is available from the Info-IBMPC
public domain source library, in directory <info-ibmpc> at USC-ISIB.ARPA.  By
manipulating compile-time definitions one can compile all the standard
(4.2, System V, etc.) versions from it, in addition to being able to compile
PC/MS-DOS big and small versions.