[comp.binaries.ibm.pc.d] lharc algorithm

bobmon@iuvax.cs.indiana.edu (RAMontante) (04/11/89)

davidsen@crdos1.UUCP (bill davidsen) <13553@steinmetz.ge.com> :
-I think the discussion of going to zoo as a standard is no longer
-germane. Now that the source for the lharc algorithm is out I would
-expect to see a portable archiver for UNIX/DOS/AmigaDOS/ST500/Mac
-shortly.

I agree.  I am really curous to see how that algorithm stacks up against
16-bit LZW compression as represented by UNIX compresses.

Rahul mentioned that the compression code in zoo was a couple of hundred
lines out of the total.  I'd like to see a next-generation archiver with
zoo's versatility and lharc's compression (viewing it as the current
high-water mark).  That should yield the portability "for free", as it were.

[And I propose that it be called BSUJAParc, the BSU-Japan All-Purpose
ARChiver :-]

dhesi@bsu-cs.bsu.edu (Rahul Dhesi) (04/11/89)

In article <19456@iuvax.cs.indiana.edu> bobmon@iuvax.cs.indiana.edu
(RAMontante) writes:
>Rahul mentioned that the compression code in zoo was a couple of hundred
>lines out of the total.  I'd like to see a next-generation archiver with
>zoo's versatility and lharc's compression (viewing it as the current
>high-water mark).  That should yield the portability "for free", as it were.

(Actually about 600 lines out of 9000.)

A steady stream of inquiries about this has begun to trickle in.  To
stall the deluge, I therefore hereby announce that lharc's lzhuff
algorithm *will* be incorporated into zoo.  The new zoo will create
archives with lzhuff compression that will have a different filename
suffix, but it will also optionally create archives (filename suffix
.zoo) extractable by earlier versions of zoo.

The change must come to all implementations of zoo (currently System
V/4.xBSD/Xenix, VAX/VMS, Atari/ST, AmigaDOS, MS-DOS, OS-9, OS/2), so I
need to handle this with great care.  Please avoid speculating about
the time frame, as I am not sure myself.
-- 
Rahul Dhesi <dhesi@bsu-cs.bsu.edu>
UUCP:    ...!{iuvax,pur-ee}!bsu-cs!dhesi

bobmon@iuvax.cs.indiana.edu (RAMontante) (04/11/89)

[ Rahul plans to adopt lzhuf compression in a new version of zoo ]

YAYYYYYYYYY!

1)  I hope Rahul resists the Microsoft-like urge to number this major
    change to zoo 2.01, as something like v2.02 ... :-(

2)  How about ".zho" as a suffix?  Or ".zoh" ?

cjeffery@arizona.edu (Clinton Jeffery) (04/11/89)

From an article by bobmon@iuvax.cs.indiana.edu (RAMontante):
> [ Rahul plans to adopt lzhuf compression in a new version of zoo ]
> 2)  How about ".zho" as a suffix?  Or ".zoh" ?

How about ".hoz"?
Which would you rather say:
 "I compressed these files"
or:
 "These files are hosed"?
-- 
| Clint Jeffery, University of Arizona Department of Computer Science
| cjeffery@arizona.edu -or- {noao allegra}!arizona!cjeffery
--

khc@eecea.eece.ksu.edu (Ken Carpenter) (04/12/89)

I have compiled the file lzhuf.c on a 3B1 and a 16MHz MSDOS PC (using TurboC).
On the 3B1 the resulting executable could compress, but dumped core when
attempting to expand the compressed file.  I decided to try compressing the
"core" file just dumped.  (It was about 90Kbytes long.)  At first I thought
I had hung the system, but it turned out just to be an extremely long
compress time.  I compressed the same "core" file using 16bit "compress"
and it went much more quickly.  Here are the timings on the 3B1:
  time lzhuf e core core.lzh
  input: 90112 bytes
  output: 10045 bytes
  output/input: 0.111
   real  4m10.36s
   user  4m6.85s
   sys   0m0.95s

  time compress -v core
  core: Compression: 87.95% - replaced with core.Z
   real  0m9.55s
   user  0m5.23s
   sys   0m0.96s
(the file core.Z was 10858 bytes long).

I copied the "core" file to the MSDOS PC.  Besides the LZHUF program, 
I have the 16bit compress program (also compiled with TurboC) on it.
The LZHUF program on the PC works for both e and d.  The file sizes
when compressed are almost the same as on the 3b1 (but not quite -
suggesting that the lzhuf e output on the 3b1 is bad).  The timings
on the MSDOS 386 PC were:
 For LZHUF e core c1 - 55seconds
 For COMPRESS core   - 22seconds
(compress on the PC used the "large" model, and is noticably slower
then a 12 bit compress using "small" model.)
While the 4minutes on the 3B1 to lzhuf e core, may be analomous due to
a bug in implementation there, it was obvious while watching the
screen indications of number of bytes read that LZHUF slowed down
a lot as the number of bytes got higher and higher.  Thus it may not
be a good choice for compressing large, highly redundant files.
For comparison, times for compressing "core" on the PC using the other
programs were:
  pkarc a core core -- 24sec, core.arc was 12040bytes long
  zoo -add core core - 17sec, core.zoo was 12435bytes long
--------------------------------
Ken Carpenter -   internet:  khc@eecea.eece.ksu.edu
                  UUCP: rutgers!ksuvax1!eecea!khc

dhw@itivax.iti.org (David H. West) (04/12/89)

In article <19456@iuvax.cs.indiana.edu> bobmon@iuvax.cs.indiana.edu (RAMontante) writes:
>[...]  I am really curous to see how that algorithm stacks up against
>16-bit LZW compression as represented by UNIX compresses.

People have posted figures indicating that in compression ratio, lharc 
has somewhat of an edge over pkzip, which may have a slight edge over 
arc.
I haven't done any method-exhaustive tests, but in my (small sample)
experience, 16-bit un*x-style compress beats arc quite substantially.
Example (the .arc file was crunched):

-rw-r-----  1 dhw       1794048 Apr 11 12:53 sbp_v3.tar
-rw-r-----  1 dhw       1071444 Apr 11 13:20 sbp_v3tar.arc
-rw-r-----  1 dhw        916473 Dec 28 10:41 sbp_v3.tar.Z

Someone who has the time should do more extensive tests...

There may be a problem though, in that I've never seen a 16-bit
MSDOS compress (as opposed to uncompress, which was posted a few
weeks ago); there are rumors to the effect that to get a 16-bit
compress to run in 640K, you have to swap pieces of the table to disk, 
which slows execution to a crawl.  A refutation of this, preferably in 
the form of public domain code, would be very welcome.

-David West            dhw@itivax.iti.org
		       {uunet,rutgers,ames}!sharkey!itivax!dhw
COMPIS, Industrial Technology Institute, PO Box 1485, 
Ann Arbor, MI 48106

cjeffery@arizona.edu (Clinton Jeffery) (04/12/89)

From an article, by khc@eecea.eece.ksu.edu (Ken Carpenter):
complaining about lzhuf.c...
> On the 3B1 the resulting executable could compress, but dumped core when
> attempting to expand the compressed file... [not to mention]
> an extremely long compress time.  I compressed the same "core" file using
> 16bit "compress" and it went much more quickly.

Rather than assume a bug in the implementation, note that it was posted
for TURBO C only--I was able to get it to work fine on a VAX by changing
all the ints and unsigneds to shorts and unsigned shorts.

Don't assume a bug in an implementation when porting to a machine with
a different wordsize!!  The lzh algorithms are obviously superior to
what we have been using.  Your timing complaint is valid with respect
to lzhuf.c but not the lharc program.  Lzhuf.c spends more than half
its encoding time within a single routine, which suggests someone more
clever than I could speed it up without sacrificing portability...
-- 
| Clint Jeffery, University of Arizona Department of Computer Science
| cjeffery@arizona.edu -or- {noao allegra}!arizona!cjeffery
--

nelson@sun.soe.clarkson.edu (Russ Nelson) (04/12/89)

In article <10184@megaron.arizona.edu> cjeffery@arizona.edu (Clinton Jeffery) writes:

   Lzhuf.c spends more than half its encoding time within a single
   routine, which suggests someone more clever than I could speed it
   up without sacrificing portability...

I suppose I'm going to get flamed for this, but I added in-line
assembly language lines to lzhuf.c and managed to improve both
compression and extraction by half.  That is, they both run about 1.5
times faster.  Part of this gain is due to assembly language, and another
part is due to really obvious things like using a 4K setvbuf on input and
output, using fwrite in preference to fputc, unrolling loops, and keeping
inner loops tight so as to give the compiler's optimizer a chance.

So my code will be useful to everyone, not just those of us doomed to program
under MS-LOSS.

In general, if you want to speed up compression, optimize the inner loop
comparing characters in InsertNode.  This routine is very similar to memcmp
except that it also tells you where the mismatch was.  If you want to speed
up decompression, speed up the hidden inner loop in update().  Look for
that little while statement comparing k against freq.



--
--russ (nelson@clutx [.bitnet | .clarkson.edu])
America -- Socialism for the rich people, Capitalism for the rest of us.
	- Michael Harrington, Co-Chair, Democratic Socialists of America

george@rebel.UUCP (George M. Sipe) (04/12/89)

From an article by bobmon@iuvax.cs.indiana.edu (RAMontante):
> [ Rahul plans to adopt lzhuf compression in a new version of zoo ]
> 2)  How about ".zho" as a suffix?  Or ".zoh" ?

How about ".arc"?  (Just kidding.)

-- 
George M. Sipe,		       Phone: (404) 447-4731
537 Lakeshore Drive, Berkeley Lake, GA  30136-3035
UUCP: ...!{decvax,linus,rutgers}!gatech!rebel!george

leonard@bucket.UUCP (Leonard Erickson) (04/16/89)

In article <19482@iuvax.cs.indiana.edu> bobmon@iuvax.cs.indiana.edu (RAMontante) writes:
<[ Rahul plans to adopt lzhuf compression in a new version of zoo ]

Considering the usual arguments about using *any* kind of archiver or
compression on files posted to the net, has anybody run tests comparing
compressed uuencode lharc text files to merely compressing the original
text files?

I want an *end* to the flames about "bandwidth" when someone posts 
source files in an "arc" file!

(I'd also love to see shar'ed files disappear! I have extract them on
 a *nix system, and some groups do such stupid things as figure a
checksum on the output, but fiigure it with CR/LF. But since I unshar
on a *nix system, the checksum never matches... grrr)

-- 
Leonard Erickson		...!tektronix!reed!percival!bucket!leonard
CIS: [70465,203]
"I'm all in favor of keeping dangerous weapons out of the hands of fools.
Let's start with typewriters." -- Solomon Short