[comp.protocols.tcp-ip] Message compression

teb@SATURN.ACC.COM (Tom Benkart) (04/03/91)

In applications involving file transfers (FTP) over low speed lines,
has anyone investigated any type of data compression on a message
by message basis?  I'm familiar with the header compression
algorithms, but don't know of any data compression algorithms.
The pay-off might not be very high, but any compression could
be useful.

Tom Benkart
ACC

solensky@animal.clearpoint.com (Frank T. Solensky) (04/04/91)

In article <9104031437.AA14650@saturn.acc.com> teb@SATURN.ACC.COM (Tom Benkart) writes:
   In applications involving file transfers (FTP) over low speed lines,
   has anyone investigated any type of data compression on a message
   by message basis?  I'm familiar with the header compression
   algorithms, but don't know of any data compression algorithms.
   The pay-off might not be very high, but any compression could
   be useful.

One of the problems with a number of data compression algorithms
(eg: Lempel-Ziv encoding, the one used by the Unix 'compress' command)
is that they need to be able to look at the entire data stream before
being able to compress any part of it.  In this case, it would be about
the same as running compress yourself and then FTPing the resulting file.

Another approach would be to use something like an adaptive Huffman code
(the details of which I know nothing about).  The '/usr/old/compact' command
on Sun's Release 4.1 takes this approach by encoding the data as it is being
read in and sending it out immediately.  The man page lists the following
reference (and also indicates that the command will NOT be distributed or
supported in future releases):
	Gallager, Robert G., Variations on a Theme of Huffman,
	IEEE Transactions on Information Theory, vol.IT-24, no. 6,
	Novermber 1978, pp. 668-674.

I believe that the Applications Area of the Internet Engineering Task Force
is planning to look into improvements to FTP such as this at some point in
the near future, but I'm afraid I don't know what the time frames or immediate
plans are.


--
				-- Frank Solensky / Clearpoint Research Corp.
Red Sox magic number: 163

henry@zoo.toronto.edu (Henry Spencer) (04/05/91)

In article <SOLENSKY.91Apr4114120@animal.clearpoint.com> solensky@animal.clearpoint.com (Frank T. Solensky) writes:
>One of the problems with a number of data compression algorithms
>(eg: Lempel-Ziv encoding, the one used by the Unix 'compress' command)
>is that they need to be able to look at the entire data stream before
>being able to compress any part of it...

Telebit would be very surprised to hear this; all their modems do
Lempel-Ziv compression on the fly.  No, LZ does *not* need to look at
the entire data stream before being able to compress any part of it.
-- 
"The stories one hears about putting up | Henry Spencer @ U of Toronto Zoology
SunOS 4.1.1 are all true."  -D. Harrison|  henry@zoo.toronto.edu  utzoo!henry

solensky@animal.clearpoint.com (Frank T. Solensky) (04/05/91)

In article <1991Apr4.180239.12442@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
   Telebit would be very surprised to hear this; all their modems do
   Lempel-Ziv compression on the fly.  No, LZ does *not* need to look at
   the entire data stream before being able to compress any part of it.

Faux pas, mea culpa, *sigh*!  Looking at the man page again, I see that
the "compress" command does indeed use an _adaptive_ LZ coding..  The main
point I wanted to make was that the IETF is, at least, aware of the desire
for compression within FTP.

--
				-- Frank Solensky / Clearpoint Research Corp.
Red Sox magic number: 163

ww0n+@ANDREW.CMU.EDU (Walter Lloyd Wimer III) (04/05/91)

> Excerpts from internet.tcp-ip: 4-Apr-91 Re: Message compression Frank T.
> Solensky@ucsd.e (1666)

> One of the problems with a number of data compression algorithms
> (eg: Lempel-Ziv encoding, the one used by the Unix 'compress' command)
> is that they need to be able to look at the entire data stream before
> being able to compress any part of it.  In this case, it would be about
> the same as running compress yourself and then FTPing the resulting file.


From empirical evidence, I don't believe this is true.  The 'compress'
program can read from a pipe.  I've used this feature to create a
compressed tar file of a directory tree in a single step:

	tar -cf - somedirectory | compress -c > somedirectory.tar.Z

Granted, it could be buffering quite a bit of data in virtual memory,
but I doubt it was buffering the 90 megabytes worth of data from one
particular tar I remember.  (My system only has 64 megs of swap space. .
. .)

Recently, there was also an excellent posting to the comp.sources.unix
newsgroup concerning compression techniques.  It included a draft of a
paper on the workings of various compression algorithms, including a
newly-invented variant of Lempel-Ziv which the author seems to claim is
free of patent (he calls it "Y-coding").  While I know and understand
very little about compression techniques, a brief reading of the paper
seems to suggest that these compression techniques work quite well even
applied to (relatively small) finite-sized chunks of data.  An
implementation of the new algorithm is included in the posting.


Walt Wimer
Network Development
Carnegie Mellon University

dcarr@hobbit.gandalf.ca (Dave Carr) (04/05/91)

In <1991Apr4.180239.12442@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:

>In article <SOLENSKY.91Apr4114120@animal.clearpoint.com> solensky@animal.clearpoint.com (Frank T. Solensky) writes:
>>One of the problems with a number of data compression algorithms
>>(eg: Lempel-Ziv encoding, the one used by the Unix 'compress' command)
>>is that they need to be able to look at the entire data stream before
>>being able to compress any part of it...

>Telebit would be very surprised to hear this; all their modems do
>Lempel-Ziv compression on the fly.  No, LZ does *not* need to look at
>the entire data stream before being able to compress any part of it.
>-- 
I agree with Henry.  LZW type algorithms can me made to adapt very quickly
by adjusting the number of characters learned after each match.  For example,
V.42 bis data compression has the number of characters learned as a settable
parameter (N8 I believe).

As for speed, LZW can be made very fast.  After all, it was designed with 
disk controller applications. 

The main problem with LZW is expanding uncompressable (or precompressed) data.
Again, this is addressed with V.42 bis (but not with standard LZW).

If the compressor can handle the uncompressable cases without expansion (or
minimize the expansion), then TCPIP compression is a winning solution.

Couple this with Van Jacobson's header compression, (or better the one which we 
will soon debut) and you can get substantial compression.  I won't quote 
compression rates and spoil the marketting peoples fun.

BTW Henry, does Telebit use LZW, LZSS, or LZH ?

maniac@howlin.cs.unlv.edu (Eric J. Schwertfeger) (04/08/91)

In article <obz08jq00WCpE4ddc4@andrew.cmu.edu>, ww0n+@ANDREW.CMU.EDU (Walter Lloyd Wimer III) writes:

) > is that they need to be able to look at the entire data stream before
) > being able to compress any part of it.  In this case, it would be about
) > the same as running compress yourself and then FTPing the resulting file.
) 
) 
) From empirical evidence, I don't believe this is true.

	It isn't.  I've written LZW compression code before, and it is highly stream oriented.  The hash table is created and updated on both ends on the fly, with only one special case.

-- 
Eric J. Schwertfeger, maniac@jimi.cs.unlv.edu

bkc@OMNIGATE.CLARKSON.EDU (Brad Clements) (04/08/91)

Regarding compressing TCP packets for serial transmission. I've developed an
extension to Jacobsons TCP header compression routines that will compress
the TCP data portion using Splay tree compression. This routine uses
'state trees' with a variable number of trees being assigned to each
active TCP connection (and depending on the major data direction of the stream).

The routine dynamically moves state trees to connections where it does the most
good, and can do so without entirely resetting the trees of existing connections
that gain or lose trees.

For example, I can support 32 state trees in about 32K of memory. This
gives me a 5 to 1 compression on 3278 type datastreams, 3  - 4 to 1 for
standard text.

The advantage of using state trees where two fold: 

	1. Low memory overhead in the kernel
	2. the ability to dynamically reconfigure memory usage
	    (in this case, moving trees between connections as needed)
  	   without having to dump 'all learned patterns' each time.

The routine automatically detects uncompressible data (for ftp's of
already compressed files, for example). etc etc.

Anyway, this is all quite experimental, runs on top of PPP for Sun OS,
and isn't yet available for general consumption (please don't ask for the
code yet).

I know that compressing modems are becoming more prevelant these days, but
this cheap solution gives my 2400 baud modem 5600 baud throughput
(best case). I have not had a chance to compare this method with 
compression of the entire datastream within the modem. However I think
even using splay trees (which are not as effective as adaptive LZW), I get
better compression on serial links which are supporting different TCP
sessions, since the compression state information is tracked by TCP
connection, and the TCP header data (already 'compressed' by VJC) is not
examined during the compression process.

Of course, when packets are dropped, VJ compression gives the splay routines
enough information to have both the compressor and decompressor reset their
trees.


| Brad Clements          bkc@omnigate.clarkson.edu        bkc@clutx.bitnet 
| Sr. Network Engineer   Clarkson University              (315)268-2292