[comp.sys.amiga.datacomm] A more memory efficient compress

ben@epmooch.UUCP (Rev. Ben A. Mesander) (01/26/91)

>In article <20680@know.pws.bull.com> C506634@UMCVMB.MISSOURI.EDU (Eric Edwards) writes:
>Does such a beast exist?  The current version of compress (4.0) still
>wants a big chunk of contiguous ram.  On my system, the largest block available
>after I boot up and start a shell is 372k.  This is still not enough!
>
>Surely if lharc and Zip can run under such conditions under the same conditions
>using the same compression algorithm, compress ought to be able to.  Actually,
>I wouldn't really mind if compress took 550k to run, just so long as it
>doesn't have to be contiguous.
>
>So.  Any pointers?

I've heard the MINIX folks have made a "small" compress. I'm not sure of
its contiguous memory requirements, but it's worth looking into.

I've had another problem with compress. I run AmigaUUCP 1.06D, and when I
get a compressed tar archive, I normally unpack it with:

uncompress <ruru.tar.Z | tar xvf -

But with large archives, say bigger than 600K, this hangs. Right before it
hangs, I see some garbage graphics appear at random on my screen, implying
that a wild pointer has hit chip memory. Has anyone else seen this? If it's
less than about 1.5 megabytes, I can uncompress the file, and then untar
it in seperate steps. If it's more than 2 megs, I have to transfer it to
a unix system, unpack it, and rearchive it with zoo. I have a 2.5MB Amiga
1000.

--
| ben@epmooch.UUCP   (Ben Mesander)       | "Cash is more important than |
| ben%servalan.UUCP@uokmax.ecn.uoknor.edu |  your mother." - Al Shugart, |
| !chinet!uokmax!servalan!epmooch!ben     |  CEO, Seagate Technologies   |

C506634@UMCVMB.MISSOURI.EDU (Eric Edwards) (01/27/91)

Does such a beast exist?  The current version of compress (4.0) still
wants a big chunk of contiguous ram.  On my system, the largest block available
after I boot up and start a shell is 372k.  This is still not enough!

Surely if lharc and Zip can run under such conditions under the same conditions
using the same compression algorithm, compress ought to be able to.  Actually,
I wouldn't really mind if compress took 550k to run, just so long as it
doesn't have to be contiguous.

So.  Any pointers?

Eric Edwards: c506634 @    "The 3090.  Proof that by applying state of the
Inet: umcvmb.missouri.edu   art technology to an obsolete architecture,
Bitnet: umcvmb.bitnet       one can achieve mediocre performance."

xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (01/29/91)

 C506634@UMCVMB.MISSOURI.EDU (Eric Edwards) writes:

> Does such a beast exist? The current version of compress (4.0) still
> wants a big chunk of contiguous ram. On my system, the largest block
> available after I boot up and start a shell is 372k. This is still not
> enough!

> Surely if lharc and Zip can run under such conditions under the same
> conditions using the same compression algorithm, compress ought to be
> able to. Actually, I wouldn't really mind if compress took 550k to
> run, just so long as it doesn't have to be contiguous.

> So. Any pointers?

Well, I used to compress files under Unix and uncompress them on my 512K
A1000 all the time, the secret is the "-b" flag.  When compressing _for_
a small memory system, or when compressing _on_ a small memory system,
use "-b14", "-b13", or even "-b12", until you get a size that works.

I'm a bit fuzzy on this, but I think the ## in "-b##" is the power of
two size of the look-back buffer that compress uses to find strings it
already knows that it can point to instead of copying to the output.

Obviously, the bigger the look-back buffer, the better chance of finding
a really long string match, and so the better the potential compression.
As a result, designing a _big_ buffer into compress is a Good Thing.

However, it turns out that even "-b12" is pretty efficient compared to
the default "-b16" on a Unix system or "-b14" in the Amiga implementation
I use.

If you get a file compressed by someone _else_, your best bet is to do
an uncompress, compress -b14 on your host site before transferring the
data down, just to be on the safe side.

By the way, you've mostly identified the problem you're having: the
Amiga memory is "hunky" from memory manager fragmentation, while a Unix
process gets its own clean 16M of virtual memory in which to allocate
it's work buffers; naturally compress, being a Unix utility designed
for speed, doesn't take into account that the look-back buffer might
need to be allocated as a link list of contiguous parts, slowing down
access and compression speed a lot.  Better to use the "-b12" flag
than to rewrite compress to run more slowly.

Oh, yeah, the "-b" flag isn't needed to uncompress the data, the buffer
size is a header element in the compressed data file.

As to lharc and zip, I don't know whether they inherently use smaller
buffers than the compress default (probably, though, since both had
their origins within MS-DOS's 640K address space), though obviously they
have to use buffers at least as big as the ones on the system that
created the archive you are unpacking. It is alternately possible (much
less likely) that they know how to do "hunky" buffers.

In general, your "using the same algorithm" ignores the fact that the
compress algorithm has a scaling factor controlled by the "-b" flag,
and so is really a family of algorithms with different buffer needs,
and that's where the magic is that makes things work.

Kent, the man from xanth.
<xanthian@Zorch.SF-Bay.ORG> <xanthian@well.sf.ca.us>

xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (01/29/91)

ben@epmooch.UUCP (Rev. Ben A. Mesander) writes:

> I've had another problem with compress. I run AmigaUUCP 1.06D, and
> when I get a compressed tar archive, I normally unpack it with:

> uncompress <ruru.tar.Z | tar xvf -

> But with large archives, say bigger than 600K, this hangs. Right
> before it hangs, I see some garbage graphics appear at random on my
> screen, implying that a wild pointer has hit chip memory. Has anyone
> else seen this? If it's less than about 1.5 megabytes, I can
> uncompress the file, and then untar it in seperate steps. If it's more
> than 2 megs, I have to transfer it to a unix system, unpack it, and
> rearchive it with zoo. I have a 2.5MB Amiga 1000.

See my parallel answer to yours for more detail, but one thing I did
forget to mention there, that may explain your observation, is that the
compress algorithm grows its look-back buffer dynamically, doing the
equivalent of "-b9", then ..., then "-b12", then "-b13", then "-b14",
etc. as it goes along. If a buffer grow doesn't work/isn't checked, then
stuff could get overwritten.

For small files, the look back buffer apparently doesn't get big enough
to cause problems, for big (enough) ones it does. [My memory for the
compress algorithm makes "vague" an understatement, but I think the
buffer works with a compressed version of the data, rather than the raw
bytes, so the point in an input file where it overflows and tries to
double in size again may be input data dependent.]

I'm a bit surprised it hasn't caused worse problems than you describe,
though; a wild pointer is always a joy to behold.  ;-)

As I recommended in my other answer, _always_ doing an uncompress, and
then a compress -b12, before downloading, should take care of your
problem, since the "-b" flag controls the _maximum_ look-back buffer
size that will be used to compress, and thus be required to uncompress,
the data.

Either because my "the look-back buffer" is really a family of buffers,
or because of a lot of pointer overhead per data byte, I seem to
remember that "-b12" corresponds to 64Kbytes of buffer space, with
subsequent higher "-b" values doubling that.

[Oh, yes; in another sense, the ## in "-b##" is the length in bits of
the back-pointers being written to the output data, but that also
corresponds to the distance==length of look-back table such a pointer
can span, and thus the required table size that pointer supports.]

Compression algorithms are fascinating; I could watch other people
write them all day.  Maybe some day I'll be well enough to resume
work on mine.

Kent, the man from xanth.
<xanthian@Zorch.SF-Bay.ORG> <xanthian@well.sf.ca.us>

limonce@pilot.njin.net (Tom Limoncelli) (01/29/91)

In article <1991Jan28.192438.14385@zorch.SF-Bay.ORG> xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes:

> Compression algorithms are fascinating; I could watch other people
> write them all day.  Maybe some day I'll be well enough to resume
> work on mine.

I was thinking the other day, it's possible to beat LZ compression by
an order of magnitude by methods that just aren't as fast.  LZ trades
size for super speed.

A good tokenizing compression could be a killer for Usenet news, but
you'd need a new token table for each class of newsgroup (sex, social,
technical, etc.).

I wonder if you could do a statistical analysis of each newsgroup,
figure out which ones have similar tables, and compress those using
their tables, sources and binary newsgroups using LZ compress, etc.

In fact, the token tables could be dynamic.  We could create something
like comp.mail.maps (news.compress.tables?) which would constantly be
posting the tables generated for last months newsflow.

I've got to stop wrting email late at night.  This almost makes sense.

Tom
-- 
One thousand, one hundred, seventy five people died of AIDS
last week.  Did someone mention a war in Iraq?

C506634@UMCVMB.MISSOURI.EDU (Eric Edwards) (01/30/91)

In Message-ID: <1991Jan28.190108.13993@zorch.SF-Bay.ORG>
          xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) said:
>
>If you get a file compressed by someone _else_, your best bet is to do
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     The only case I'm really interested in.....
>an uncompress, compress -b14 on your host site before transferring the
>data down, just to be on the safe side.
Not practical.  UMCVMB is not a unix box. It's as IBM 3090 170j with 7 bit dial
ups.  If I have to muck with file on the mainframe I'll uncompress it and
download the resulting text file after coverting to ebcdic..  It's faster but
still a lot of hassle.

Really though, if I have a lot of files to transfer I take the mainframe out
of the loop and ftp directly to a PC.  There's no opportunity to massage the
files first.

>Better to use the "-b12" flag than to rewrite compress to run more slowly.

Great idea Kent!  Now, how do we go about convicing all the archive operators
on the Internet to compress with "-b12"?

>As to lharc and zip, I don't know whether they inherently use smaller
>buffers than the compress default (probably, though, since both had

The point is, they achieve as good or better compression than compress with
lesser memory requirements.

>In general, your "using the same algorithm" ignores the fact that the
>compress algorithm has a scaling factor controlled by the "-b" flag,

Which is almost useless information since almost all compressed files are
done with 16 bit compression.

On a more productive note, I've made a little progress in getting compress to
run on my system.  By running nofastmem at the beginning and end of my startup-
sequence I am able to force most memory allocations into chip ram.  This leaves
a 450k block of fast ram after boot up, enough to run compress.  It doesn't
very long, though for this block to get broken up.  So, in general, I have to
reboot if I want to run compress.

Eric Edwards: c506634 @    "The 3090.  Proof that by applying state of the
Inet: umcvmb.missouri.edu   art technology to an obsolete architecture,
Bitnet: umcvmb.bitnet       one can achieve mediocre performance."