[comp.dcom.modems] Merits of compression

tgl@ZOG.CS.CMU.EDU (Tom Lane) (04/17/88)

In article <4435@hoptoad.uucp>, gnu@hoptoad.uucp (John Gilmore) writes:
> It's sad but true that the bottleneck in sending ASCII data between
> systems through a Telebit modem is getting to be the 19200 max speed on
> the serial cable.  Telebit really should support 38,400 baud.

Actually, the bottleneck is getting to be the computer system's serial
data transfer rate; see below.  I seriously doubt that many Unix systems
could sustain a 38kbps transfer through a serial port, especially if
you wanted to get any other work done at the same time.

> Jerry Aguirre mentioned that compressing data on the host burns less
> CPU cycles too; this would be because running "compress" over say 500K
> of data is cheaper than squirting 188K (the data saved) out the serial
> port.  The inner loop of compress must be a *lot* smaller than the
> inner loop of the serial drivers on his system, though this might not
> be true on all systems.

The way I understand it, the problem is that the serial drivers don't *have*
an inner loop; they're interrupt-per-character, and the context swap into
the kernel, for each byte, is what kills you.  Since most serial interfaces
can't do DMA or buffer a reasonable number of input characters, there isn't
any good way to improve on the situation.

The basis of the problem is that both the serial interface hardware and
the drivers were designed around the assumption of a terminal at the
other end.  The better ones can go pretty fast in the output direction,
but high-speed input just wasn't thought of.  So the bottleneck for
news transfer is usually the receiving system, not the sender.

Incidentally, LZW compression *is* pretty damn quick, and decompression is
even quicker.  It was designed for hardware implementation at megahertz
throughput rates, with not a lot of hardware except the table memories.
Welch's original article talks about implementing it in a hard disk
controller, with on-the-fly compression at disk I/O transfer rates...

I doubt that compress is going to be the bottleneck any time soon,
particularly at the receiving end.

Note that this implies that compression in the modem isn't a win; in fact,
it's a major loss if the modem tries to compress already-compressed data.
(At least, it is if both are using the same compression method.  In the
case of LZW, you are likely to see as much as 50% expansion if you try
compressing compressed data.)

-- 
				tom lane
Internet: tgl@zog.cs.cmu.edu
UUCP: <your favorite internet/arpanet gateway>!zog.cs.cmu.edu!tgl
BITNET: tgl%zog.cs.cmu.edu@cmuccvma

rroot@edm.UUCP (uucp) (04/19/88)

From article <1441@pt.cs.cmu.edu>, by tgl@ZOG.CS.CMU.EDU (Tom Lane):
> In article <4435@hoptoad.uucp>, gnu@hoptoad.uucp (John Gilmore) writes:
>> It's sad but true that the bottleneck in sending ASCII data between
>> systems through a Telebit modem is getting to be the 19200 max speed on
>> the serial cable.  Telebit really should support 38,400 baud.

> The way I understand it, the problem is that the serial drivers don't *have*
> an inner loop; they're interrupt-per-character, and the context swap into
> the kernel, for each byte, is what kills you.  Since most serial interfaces
Note that only applies for MOST systems.  There are, in fact some that can
handle block I/O. I believe the some of the Convergent I/O boards fit 
in this category, since I have seen the box I'm on sustain a baud rate
in the high 8000's running UUCP at 9600 baud. (I've never seen it run
at 19200, though).
  If the processor is set up for DMA, then having the modem do the 
compression SHOULD be a win. I think that would be the case here.
  PC-type boxes/boards, because they aren't generally designed for 
heavy multi-user work are more likely to have real problems with I/O
interrupts.
-- 
-------------
 Stephen Samuel 
  {ihnp4,ubc-vision,vax135}!alberta!edm!steve
  or userzxcv@uqv-mts.bitnet

work@dragos.UUCP (Dragos Ruiu) (04/19/88)

In article <1441@pt.cs.cmu.edu>, tgl@ZOG.CS.CMU.EDU (Tom Lane) writes:
[Discussion on whether compression in the modem is a win, mostly with news
feeds]
> 
> Note that this implies that compression in the modem isn't a win; in fact,
> it's a major loss if the modem tries to compress already-compressed data.
> (At least, it is if both are using the same compression method.  In the
> case of LZW, you are likely to see as much as 50% expansion if you try
> compressing compressed data.)
> 
> -- 
> 				tom lane
> Internet: tgl@zog.cs.cmu.edu
> UUCP: <your favorite internet/arpanet gateway>!zog.cs.cmu.edu!tgl
> BITNET: tgl%zog.cs.cmu.edu@cmuccvma

   It is very simple to keep track of your on the fly compression rate with
   LZW. I would assume that as soon as it sees negative compression for X
   packets, the 'Blazer would shut off compression for a while or send the 
   next packet in plain text, only resuming the compression when a packet
   that gives a positive compression ratio hits.

   This would seem straightforward to implement. Can anyone enlighten us as
   to whether Telebit does this or not ?

   BTW the UNIX compress utility will not replace files that give negative
   compression ratios. It might also be interesting to hear how Telebit
   chose to implement LZW, i.e. when they flush their code table and if they
   use variable length bit-strings when compression is turned on.
    
   How about it ? Does someone know ? Are you allowed to tell ?

   Someone out there with a 'Blazer might test the compression shut off by
   timing the compressed transfer of two files, one compressed and one in
   its original format.

-- 
Dragos Ruiu   ruiu@dragos.UUCP
        ...alberta!dragos!ruiu   "cat ansi.c | grep -v noalias >proper.c"

tgl@ZOG.CS.CMU.EDU (Tom Lane) (04/20/88)

In article <405@dragos.UUCP>, work@dragos.UUCP (Dragos Ruiu) writes:
>    It is very simple to keep track of your on the fly compression rate with
>    LZW. I would assume that as soon as it sees negative compression for X
>    packets, the 'Blazer would shut off compression for a while or send the 
>    next packet in plain text, only resuming the compression when a packet
>    that gives a positive compression ratio hits.

One of the few bad features of LZW is that it is very hard to turn it on and
off "on the fly" like this.  The problem is that the compression and
decompression symbol tables must be kept in sync, or decompression will be
impossible.  Compressing a chunk of text nearly always gives rise to new
entries in the table, so you cannot easily compress a packet and then decide
not to send it in compressed form.  (The decompressor must be run against
the compressed packet in order to update its own table.)  There are ways
around this if you have oodles of memory and computing power (remembering
and undoing the table changes, for instance), but I doubt that Telebit's
little 68000 is twiddling its thumbs much...

A simple way of doing it would be to flush both tables whenever you decided
that compression should be temporarily disabled.  You then couldn't turn
compression back on until you had seen some number N of packets that showed
adequate compression *starting from an empty table*.  I suspect that this
would hurt your overall results badly, but I don't have any data to prove it.
Anybody know of any actual research that's been done along this line?

>    It might also be interesting to hear how Telebit
>    chose to implement LZW, i.e. when they flush their code table and if they
>    use variable length bit-strings when compression is turned on.

I'd be interested to hear that too.

-- 
				tom lane
Internet: tgl@zog.cs.cmu.edu
UUCP: <your favorite internet/arpanet gateway>!zog.cs.cmu.edu!tgl
BITNET: tgl%zog.cs.cmu.edu@cmuccvma

work@dragos.UUCP (Dragos Ruiu) (04/22/88)

In article <1473@pt.cs.cmu.edu>, tgl@ZOG.CS.CMU.EDU (Tom Lane) writes:
> 
> One of the few bad features of LZW is that it is very hard to turn it on and
> off "on the fly" like this.  The problem is that the compression and
> decompression symbol tables must be kept in sync, or decompression will be
> impossible.  Compressing a chunk of text nearly always gives rise to new
> entries in the table, so you cannot easily compress a packet and then decide
> not to send it in compressed form.  (The decompressor must be run against
> the compressed packet in order to update its own table.) 
[edit]
> A simple way of doing it would be to flush both tables whenever you decided
> that compression should be temporarily disabled.  You then couldn't turn
> compression back on until you had seen some number N of packets that showed
> adequate compression *starting from an empty table*.  I suspect that this
> would hurt your overall results badly, but I don't have any data to prove it.
> Anybody know of any actual research that's been done along this line?

  From Welch's original paper (IEEE Computer June 1984), he says that LZW
  has a 5K adaptation zone (he was using fixed 12 bit codes). My personal
  experience has been that this is a conservative estimate. On my compressed
  packet machine experiment we found that on the average it took about 1.5-3K
  before compression started to show a win with an empty table. This surprised
  us, but we didn't have time to research it thoroughly. So these are all
  ballpark guestimates.  This was using a fixed 12 bit code with the average
  files kicking around a Un*x system. A variable bit implementation should
  find the 'win zone' much quicker.
   
  If you flushed the tables when compression started going awry, you would
  would keep compressing but ignoring the compressed output. When the 
  compression ratio goes back up again to something you like, you flush
  the tables again and tell the other side to start decompressing once more.
  Using Welch's 5K figure you lose 5K to find out the initial compression
  rate has gone back to an acceptable level and you lose another 5 K before
  your compression adapts back to a good level. So in this worst case
  scenario you send 10K of data that could have been compressed. I think
  that real world use would see you lose a max of about 4K of data that could
  have been compressed.
   
  On the plus side though, you gain automagic compression shutoff, and the
  benefits in throughput when your data stream sometimes contains
  pre-compressed data is a *big* win. To get this win you would have to
  keep track of the compression ratios of the last N packets, and only
  turn off compression when a history of badly compressed packets shows
  up. If your history window is too small, you may take too many 4K 'hits'
  to make your compression less effective. But when a xxxK news batch
  goes through, the modem is going to have many chances to realize
  compression is not a good idea.
  
  I'll agree with Tom that perhaps asking this of Telebit might be much, as 
  the problem is more difficult than the off the cuff analysis would indicate.
  Maybe they can do it later.... (And maybe they can do V.32 and FAX and
  maybe the Usenet deal will come back and maybe I'll someday afford one
  :-)

> 				tom lane
> Internet: tgl@zog.cs.cmu.edu
> UUCP: <your favorite internet/arpanet gateway>!zog.cs.cmu.edu!tgl
> BITNET: tgl%zog.cs.cmu.edu@cmuccvma


-- 
Dragos Ruiu   ruiu@dragos.UUCP
        ...alberta!dragos!ruiu   "cat ansi.c | grep -v noalias >proper.c"

John_M@spectrix.UUCP (John Macdonald) (04/26/88)

In article <406@dragos.UUCP> work@dragos.UUCP (Dragos Ruiu) writes:
:In article <1473@pt.cs.cmu.edu>, tgl@ZOG.CS.CMU.EDU (Tom Lane) writes:
:> 
:> One of the few bad features of LZW is that it is very hard to turn it on and
:> off "on the fly" like this.  The problem is that the compression and
:> decompression symbol tables must be kept in sync, or decompression will be
:> impossible.  Compressing a chunk of text nearly always gives rise to new
:> entries in the table, so you cannot easily compress a packet and then decide
:> not to send it in compressed form.  (The decompressor must be run against
:> the compressed packet in order to update its own table.) 
:[edit]
:> A simple way of doing it would be to flush both tables whenever you decided
:> that compression should be temporarily disabled.  You then couldn't turn
:> compression back on until you had seen some number N of packets that showed
:> adequate compression *starting from an empty table*.  I suspect that this
:> would hurt your overall results badly, but I don't have any data to prove it.
:> Anybody know of any actual research that's been done along this line?
:
:  From Welch's original paper (IEEE Computer June 1984), he says that LZW
:  has a 5K adaptation zone (he was using fixed 12 bit codes). My personal
:  experience has been that this is a conservative estimate. On my compressed
:  packet machine experiment we found that on the average it took about 1.5-3K
:  before compression started to show a win with an empty table. This surprised
:  us, but we didn't have time to research it thoroughly. So these are all
:  ballpark guestimates.  This was using a fixed 12 bit code with the average
:  files kicking around a Un*x system. A variable bit implementation should
:  find the 'win zone' much quicker.
:   
:  If you flushed the tables when compression started going awry, you would
:  would keep compressing but ignoring the compressed output. When the 
:  compression ratio goes back up again to something you like, you flush
:  the tables again and tell the other side to start decompressing once more.
:  Using Welch's 5K figure you lose 5K to find out the initial compression
:  rate has gone back to an acceptable level and you lose another 5 K before
:  your compression adapts back to a good level. So in this worst case
:  scenario you send 10K of data that could have been compressed. I think
:  that real world use would see you lose a max of about 4K of data that could
:  have been compressed.
:   
:  On the plus side though, you gain automagic compression shutoff, and the
:  benefits in throughput when your data stream sometimes contains
:  pre-compressed data is a *big* win. To get this win you would have to
:  keep track of the compression ratios of the last N packets, and only
:  turn off compression when a history of badly compressed packets shows
:  up. If your history window is too small, you may take too many 4K 'hits'
:  to make your compression less effective. But when a xxxK news batch
:  goes through, the modem is going to have many chances to realize
:  compression is not a good idea.
:  
:  I'll agree with Tom that perhaps asking this of Telebit might be much, as 
:  the problem is more difficult than the off the cuff analysis would indicate.
:  Maybe they can do it later.... (And maybe they can do V.32 and FAX and
:  maybe the Usenet deal will come back and maybe I'll someday afford one
:  :-)
:
:> 				tom lane
:> Internet: tgl@zog.cs.cmu.edu
:> UUCP: <your favorite internet/arpanet gateway>!zog.cs.cmu.edu!tgl
:> BITNET: tgl%zog.cs.cmu.edu@cmuccvma
:
:
:-- 
:Dragos Ruiu   ruiu@dragos.UUCP
:        ...alberta!dragos!ruiu   "cat ansi.c | grep -v noalias >proper.c"


There is a good solution to this!  For each chunk, do the compression.
If the result is smaller, use, otherwise send the original. Precede
this choice with a one-bit flag to show whether the next chunk is compressed.
(In a modem you can get away with sending a one-bit flag, but even if that's
too much bother, a whole byte isn't very expensive.)  In both cases, you
update the tables as if you did the compression.

In the receiver, you read and check the flag.  If it indicates compression,
then uncompress as usual.  If not, *COMPRESS* the incoming data to keep
the tables up-to-date, but keep the uncompressed data to pass on and throw
away the compressed data.

For example:

original text:  ABCD EFGH

compress(ABCD,T1) = (MNO,T2),  compress(EFGH,T2) = (PQRST,T3)

source sends:   yMNO nEFGH    (y and n are the flags)

receiver get this and does:
uncompress(MNO,T1) = (ABCD,T2), compress_and_ignore(EFGH,T2) = (EFGH,T3)

receiver outputs: ABCD EFGH
-- 
John Macdonald   UUCP:    {mnetor,utzoo}             !spectrix!jmm

tgl@ZOG.CS.CMU.EDU (Tom Lane) (04/26/88)

In article <580@spectrix.UUCP>, John_M@spectrix.UUCP (John Macdonald) writes:
> There is a good solution to this!  For each chunk, do the compression.
> If the result is smaller, use, otherwise send the original. Precede
> this choice with a one-bit flag to show whether the next chunk is compressed.
> [...] In both cases, you
> update the tables as if you did the compression.
> 
> In the receiver, you read and check the flag.  If it indicates compression,
> then uncompress as usual.  If not, *COMPRESS* the incoming data to keep
> the tables up-to-date, but keep the uncompressed data to pass on and throw
> away the compressed data.

Yeah, that's the obvious thing to do.  The trouble is that compression
and decompression tables are not the same thing; if you think about it,
they perform reverse functions, so it's not surprising that they need
to be indexed differently.  For LZW compression like the TrailBlazer
uses, the compression table is usually a hash table indexed by
<previous symbol, current character>; while the decompression table can
be a simple array indexed by <current symbol>.

I haven't come up with any better way of doing this idea than having
the receiver maintain BOTH compression and decompression tables,
updating both tables whenever a packet is received (using either the
compression or decompression algorithm, as required).  This is a
nontrivial hit in CPU terms, but the place where it really hurts you is
memory space.  A compression table for 12-bit-symbol LZW (TB algorithm)
is 24KB or more, depending on what kind of hash table load factor
you're willing to tolerate (= more CPU hit...).  A TrailBlazer hasn't
got that kind of memory to spare, I would imagine.

On the other hand, if you are talking compression/decompression in the
mainframe rather than the modem, and you've got oodles of memory space
and CPU cycles to play with, you could do it.  I guess that's another
argument for my original point (which was that the modem is the wrong
place to do the compression).

Anybody know anything about cheap ways to build doubly-indexed tables??
You could become the next initial in LZW!
PS: to really deserve that initial you should also figure out a really
good yet cheap algorithm for deciding when to flush the tables...
I seem to recall some discussion of that in the COMPRESS mailing list,
but I don't think it was ever solved satisfactorily.

-- 
				tom lane
Internet: tgl@zog.cs.cmu.edu
UUCP: <your favorite internet/arpanet gateway>!zog.cs.cmu.edu!tgl
BITNET: tgl%zog.cs.cmu.edu@cmuccvma

work@dragos.UUCP (Dragos Ruiu) (04/28/88)

In article <1529@pt.cs.cmu.edu>, tgl@ZOG.CS.CMU.EDU (Tom Lane) writes:
> In article <580@spectrix.UUCP>, John_M@spectrix.UUCP (John Macdonald) writes:
> [Talks about compressing incoming uncompressed data to keep LZW tables good]

> Yeah, that's the obvious thing to do.  The trouble is that compression
> and decompression tables are not the same thing; if you think about it,
> they perform reverse functions, so it's not surprising that they need
> to be indexed differently.  For LZW compression like the TrailBlazer
> uses, the compression table is usually a hash table indexed by
> <previous symbol, current character>; while the decompression table can
> be a simple array indexed by <current symbol>.
> 
> I haven't come up with any better way of doing this idea than having
> the receiver maintain BOTH compression and decompression tables,
> updating both tables whenever a packet is received (using either the
> compression or decompression algorithm, as required).  This is a
> nontrivial hit in CPU terms, but the place where it really hurts you is
> memory space.  A compression table for 12-bit-symbol LZW (TB algorithm)
> is 24KB or more, depending on what kind of hash table load factor
> you're willing to tolerate (= more CPU hit...).  A TrailBlazer hasn't
> got that kind of memory to spare, I would imagine.
> 
> -- 
> 				tom lane
> Internet: tgl@zog.cs.cmu.edu
> UUCP: <your favorite internet/arpanet gateway>!zog.cs.cmu.edu!tgl
> BITNET: tgl%zog.cs.cmu.edu@cmuccvma

 First a couple of points:
   -When decide to turn compression off, keeping the tables updated is
   probably of little significance, because the codes being built are
   useless.

   -Last I heard the Telebit was reputed to have 512K of ram.

 You can keep a compression table and a decompression table simultaneously.
 Simply have the hash table indicate entries in the (de)compression table
 instead of keeping the code and the follow. (This saves you memory even
 if you don't care about keeping a decompression table.) The disadvantage
 of this approach is that you take an extra indirection hit on every hash
 table lookup. The hash table gives you the entry number, which you must 
 then look up in the entry table.

 A minimal memory setup for bi-directional compression is about 32K:
  
     10K decompression table:  4096 * (1.5 byte code + 1 byte follow)
     10K compression entry table: "                                "
     12K hash table: 8096 * 1.5 byte compression table entry numbers.

 This has a 50% hash table occupancy.

 And if you wanted to do 'compression' on the decompressed input like
 John MacDonald suggested, you have to add another 12K hash table that
 points to entries in the decompression table.
-- 
Dragos Ruiu   ruiu@dragos.UUCP    "One can ceratinly imagine the myriad of
        ...alberta!dragos!ruiu     uses for a hand-held iguana maker." -Hobbes