[comp.os.minix] Compression programs: compress vs lharc vs comic

tgcpwd@rc3.urc.tue.nl (Wim van Dorst) (07/07/90)

Hello *,

I have tested the newly posted lharc and comic and compared them with
the old compress.  I took a large binary file (kermit), a large source
file (three concatenated floppy.c) and an archive file (ar q wmail/*,
thus bin source object doc etc).  I noted down the comparative sizes,
the time it took to compress and to decompress.  Lo and behold:

                    Original       compress        lharc         comic
binary
size (bytes)          63920          50539         34875         34222
size (%)                100             79            55            54
relative (%)                           100            69            68
compr time (sec)                        56           135          1207
decompr time (sec)                      34            66            50

source
size (bytes)         114114          48324         40625         37542
size (%)                100             42            36            33
relative (%)                           100            84            78
compr time (sec)                        71           224          1781
decompr time (sec)                      40            76            65

archive
size (bytes)         151896          95007         64762         61669
size (%)                100             63            43            41
relative (%)                           100            68            65
compr time (sec)                       128           290          2353
decompr time (sec)                      71           121            98

The conclusions are obvious (to me :-):
When you're keen on speed and have plenty of disk media: 
	do not compress at all.
When you're keen on speed, but rather low on disk media: 
	use oldy-but-goody compress 
When you have hardly any space on disk, when you've to use these very
small 360kB floppy for archiving like I, but you do have some computer
time to spare (e.g.in the small hours), like I:
	use comic
When you want a reasonable inbetween:
	use lharc


_My_ choice is definitively comic, but I must say _each_ program 
has its own advantages.


ps. Due to official regulations which I cannot sufficiently influence,
I seem to have trouble sending and receiving mail. I assume though
that the .urc.tue.nl domain should work properly for me now.

         Met vriendelijke groeten, Wim 'Blue Baron' van Dorst
  ------------------------------------------------------------------
  Blue Baron = Wim van Dorst, voice (+31) 074-443937 and 02152-42319
       baron@wiesje.mug.hobby.nl or tgcpwd@eutrc3.urc.tue.nl
                  "This sentence have three erors."

williams@umaxc.weeg.uiowa.edu (Kent Williams) (07/08/90)

I compiled comic on an Encore, and it appears to work.  It is V E R Y slow
compressing tars -- I didn't have the patience to let it run to completion
while I waited.  File sizes are impressive:

-rw-------  1 williams    81920 Jul  7 00:56 comic.tar
-rw-------  1 williams    31119 Jul  7 00:47 comic.tar.Z
-rw-------  1 williams    21696 Jul  7 01:06 comic.tar-X

I suspect the slowness comes in compressing the piles of nulls that
tar pads everything with.

Can the author summarize the algorithm used?  It looks as though too much
time is spent in looking for stuff -- maybe hashing could be used for
one or another step to speed things up.

--
Kent Williams                    'Look, I can understand "teenage mutant ninja 
williams@umaxc.weeg.uiowa.edu    turtles", but I can't understand "mutually 
williams@herky.cs.uiowa.edu      recursive inline functions".' - Paul Chisholm

evans@ditsyda.oz (Bruce.Evans) (07/10/90)

In article <1815@ns-mx.uiowa.edu> williams@umaxc.weeg.uiowa.edu.UUCP (Kent Williams) writes:
>
>Can the author summarize the algorithm used?  It looks as though too much
>time is spent in looking for stuff -- maybe hashing could be used for
>one or another step to speed things up.

Profiling the 16-bit version gave the following (for 18.8 sec to compress
comic.c):

  10.042   10042  53.31%  _memrchr  *******************************************
   4.801    4801  25.48%  _get2pai  *********************
   1.735    1735   9.21%  _getp1    *******
   0.524     524   2.78%  _cdsret   **
   0.238     238   1.26%  _eqlen    *
   ...
  18.093   18093  96.06%  TOTAL IN RANGE
   0.742     742   3.93%  OTHER
  18.835   18835 100.00%  TOTAL

That's with memrchr in assembler!  memrchr is used to find an anchor for a
string lookup. Hashing just on the first char would reduce this step
considerably. I don't know how expensive maintaining the hash table would be.

Here are some times for the compression programs available to me on a 20MHz
386. I/O times are significant for the faster programs (5 to 10 sec).
---
Times for packing the Minix dictionary (407564 bytes).

Program			compression	time	un-time	O/S	compiler
------------------------------------------------------------------------
pkzip1.10 -es		60%		   9	  6	DOS	?
pkpak3.61		59%		   9	  6	DOS	?
compress4.01 -b 13	57%		  14	 12	minix-3	gcc1.36
compress4.01 -b 16	58%		  15	 12	minix-3	gcc1.36
zoo2.01			59%		  20	 17	minix-3	gcc1.37
compress4.01 -b 13	57%		  27	 14	DOS	msc 4.0
compress-minix1.3	57%		  41	 23	minix	cc1.3
compress4.01 -b 13	57%		  45	 23	minix	bcc1.0
pkzip1.01		66%		  48	  6	DOS	?
pak2.10			67%		  52	  9	DOS	?
compress-??? -b 16	58%		  83	 43	DOS	?
lharc1.13c		67%		  89	 13	DOS	?
lharc1.00		67%		 168	 17	minix-3	gcc1.37
lharc1.00		67%		 275	 36	minix	bcc1.0
comic			67%		1504	 20	minix-3	gcc1.37
comic			67%		1782	 34	minix	bcc1.0
-- 
Bruce Evans		evans@ditsyda.syd.dit.csiro.au