tgl@ZOG.CS.CMU.EDU (Tom Lane) (04/17/88)
In article <4435@hoptoad.uucp>, gnu@hoptoad.uucp (John Gilmore) writes: > It's sad but true that the bottleneck in sending ASCII data between > systems through a Telebit modem is getting to be the 19200 max speed on > the serial cable. Telebit really should support 38,400 baud. Actually, the bottleneck is getting to be the computer system's serial data transfer rate; see below. I seriously doubt that many Unix systems could sustain a 38kbps transfer through a serial port, especially if you wanted to get any other work done at the same time. > Jerry Aguirre mentioned that compressing data on the host burns less > CPU cycles too; this would be because running "compress" over say 500K > of data is cheaper than squirting 188K (the data saved) out the serial > port. The inner loop of compress must be a *lot* smaller than the > inner loop of the serial drivers on his system, though this might not > be true on all systems. The way I understand it, the problem is that the serial drivers don't *have* an inner loop; they're interrupt-per-character, and the context swap into the kernel, for each byte, is what kills you. Since most serial interfaces can't do DMA or buffer a reasonable number of input characters, there isn't any good way to improve on the situation. The basis of the problem is that both the serial interface hardware and the drivers were designed around the assumption of a terminal at the other end. The better ones can go pretty fast in the output direction, but high-speed input just wasn't thought of. So the bottleneck for news transfer is usually the receiving system, not the sender. Incidentally, LZW compression *is* pretty damn quick, and decompression is even quicker. It was designed for hardware implementation at megahertz throughput rates, with not a lot of hardware except the table memories. Welch's original article talks about implementing it in a hard disk controller, with on-the-fly compression at disk I/O transfer rates... I doubt that compress is going to be the bottleneck any time soon, particularly at the receiving end. Note that this implies that compression in the modem isn't a win; in fact, it's a major loss if the modem tries to compress already-compressed data. (At least, it is if both are using the same compression method. In the case of LZW, you are likely to see as much as 50% expansion if you try compressing compressed data.) -- tom lane Internet: tgl@zog.cs.cmu.edu UUCP: <your favorite internet/arpanet gateway>!zog.cs.cmu.edu!tgl BITNET: tgl%zog.cs.cmu.edu@cmuccvma
rroot@edm.UUCP (uucp) (04/19/88)
From article <1441@pt.cs.cmu.edu>, by tgl@ZOG.CS.CMU.EDU (Tom Lane): > In article <4435@hoptoad.uucp>, gnu@hoptoad.uucp (John Gilmore) writes: >> It's sad but true that the bottleneck in sending ASCII data between >> systems through a Telebit modem is getting to be the 19200 max speed on >> the serial cable. Telebit really should support 38,400 baud. > The way I understand it, the problem is that the serial drivers don't *have* > an inner loop; they're interrupt-per-character, and the context swap into > the kernel, for each byte, is what kills you. Since most serial interfaces Note that only applies for MOST systems. There are, in fact some that can handle block I/O. I believe the some of the Convergent I/O boards fit in this category, since I have seen the box I'm on sustain a baud rate in the high 8000's running UUCP at 9600 baud. (I've never seen it run at 19200, though). If the processor is set up for DMA, then having the modem do the compression SHOULD be a win. I think that would be the case here. PC-type boxes/boards, because they aren't generally designed for heavy multi-user work are more likely to have real problems with I/O interrupts. -- ------------- Stephen Samuel {ihnp4,ubc-vision,vax135}!alberta!edm!steve or userzxcv@uqv-mts.bitnet
work@dragos.UUCP (Dragos Ruiu) (04/19/88)
In article <1441@pt.cs.cmu.edu>, tgl@ZOG.CS.CMU.EDU (Tom Lane) writes: [Discussion on whether compression in the modem is a win, mostly with news feeds] > > Note that this implies that compression in the modem isn't a win; in fact, > it's a major loss if the modem tries to compress already-compressed data. > (At least, it is if both are using the same compression method. In the > case of LZW, you are likely to see as much as 50% expansion if you try > compressing compressed data.) > > -- > tom lane > Internet: tgl@zog.cs.cmu.edu > UUCP: <your favorite internet/arpanet gateway>!zog.cs.cmu.edu!tgl > BITNET: tgl%zog.cs.cmu.edu@cmuccvma It is very simple to keep track of your on the fly compression rate with LZW. I would assume that as soon as it sees negative compression for X packets, the 'Blazer would shut off compression for a while or send the next packet in plain text, only resuming the compression when a packet that gives a positive compression ratio hits. This would seem straightforward to implement. Can anyone enlighten us as to whether Telebit does this or not ? BTW the UNIX compress utility will not replace files that give negative compression ratios. It might also be interesting to hear how Telebit chose to implement LZW, i.e. when they flush their code table and if they use variable length bit-strings when compression is turned on. How about it ? Does someone know ? Are you allowed to tell ? Someone out there with a 'Blazer might test the compression shut off by timing the compressed transfer of two files, one compressed and one in its original format. -- Dragos Ruiu ruiu@dragos.UUCP ...alberta!dragos!ruiu "cat ansi.c | grep -v noalias >proper.c"
tgl@ZOG.CS.CMU.EDU (Tom Lane) (04/20/88)
In article <405@dragos.UUCP>, work@dragos.UUCP (Dragos Ruiu) writes: > It is very simple to keep track of your on the fly compression rate with > LZW. I would assume that as soon as it sees negative compression for X > packets, the 'Blazer would shut off compression for a while or send the > next packet in plain text, only resuming the compression when a packet > that gives a positive compression ratio hits. One of the few bad features of LZW is that it is very hard to turn it on and off "on the fly" like this. The problem is that the compression and decompression symbol tables must be kept in sync, or decompression will be impossible. Compressing a chunk of text nearly always gives rise to new entries in the table, so you cannot easily compress a packet and then decide not to send it in compressed form. (The decompressor must be run against the compressed packet in order to update its own table.) There are ways around this if you have oodles of memory and computing power (remembering and undoing the table changes, for instance), but I doubt that Telebit's little 68000 is twiddling its thumbs much... A simple way of doing it would be to flush both tables whenever you decided that compression should be temporarily disabled. You then couldn't turn compression back on until you had seen some number N of packets that showed adequate compression *starting from an empty table*. I suspect that this would hurt your overall results badly, but I don't have any data to prove it. Anybody know of any actual research that's been done along this line? > It might also be interesting to hear how Telebit > chose to implement LZW, i.e. when they flush their code table and if they > use variable length bit-strings when compression is turned on. I'd be interested to hear that too. -- tom lane Internet: tgl@zog.cs.cmu.edu UUCP: <your favorite internet/arpanet gateway>!zog.cs.cmu.edu!tgl BITNET: tgl%zog.cs.cmu.edu@cmuccvma
work@dragos.UUCP (Dragos Ruiu) (04/22/88)
In article <1473@pt.cs.cmu.edu>, tgl@ZOG.CS.CMU.EDU (Tom Lane) writes: > > One of the few bad features of LZW is that it is very hard to turn it on and > off "on the fly" like this. The problem is that the compression and > decompression symbol tables must be kept in sync, or decompression will be > impossible. Compressing a chunk of text nearly always gives rise to new > entries in the table, so you cannot easily compress a packet and then decide > not to send it in compressed form. (The decompressor must be run against > the compressed packet in order to update its own table.) [edit] > A simple way of doing it would be to flush both tables whenever you decided > that compression should be temporarily disabled. You then couldn't turn > compression back on until you had seen some number N of packets that showed > adequate compression *starting from an empty table*. I suspect that this > would hurt your overall results badly, but I don't have any data to prove it. > Anybody know of any actual research that's been done along this line? From Welch's original paper (IEEE Computer June 1984), he says that LZW has a 5K adaptation zone (he was using fixed 12 bit codes). My personal experience has been that this is a conservative estimate. On my compressed packet machine experiment we found that on the average it took about 1.5-3K before compression started to show a win with an empty table. This surprised us, but we didn't have time to research it thoroughly. So these are all ballpark guestimates. This was using a fixed 12 bit code with the average files kicking around a Un*x system. A variable bit implementation should find the 'win zone' much quicker. If you flushed the tables when compression started going awry, you would would keep compressing but ignoring the compressed output. When the compression ratio goes back up again to something you like, you flush the tables again and tell the other side to start decompressing once more. Using Welch's 5K figure you lose 5K to find out the initial compression rate has gone back to an acceptable level and you lose another 5 K before your compression adapts back to a good level. So in this worst case scenario you send 10K of data that could have been compressed. I think that real world use would see you lose a max of about 4K of data that could have been compressed. On the plus side though, you gain automagic compression shutoff, and the benefits in throughput when your data stream sometimes contains pre-compressed data is a *big* win. To get this win you would have to keep track of the compression ratios of the last N packets, and only turn off compression when a history of badly compressed packets shows up. If your history window is too small, you may take too many 4K 'hits' to make your compression less effective. But when a xxxK news batch goes through, the modem is going to have many chances to realize compression is not a good idea. I'll agree with Tom that perhaps asking this of Telebit might be much, as the problem is more difficult than the off the cuff analysis would indicate. Maybe they can do it later.... (And maybe they can do V.32 and FAX and maybe the Usenet deal will come back and maybe I'll someday afford one :-) > tom lane > Internet: tgl@zog.cs.cmu.edu > UUCP: <your favorite internet/arpanet gateway>!zog.cs.cmu.edu!tgl > BITNET: tgl%zog.cs.cmu.edu@cmuccvma -- Dragos Ruiu ruiu@dragos.UUCP ...alberta!dragos!ruiu "cat ansi.c | grep -v noalias >proper.c"
John_M@spectrix.UUCP (John Macdonald) (04/26/88)
In article <406@dragos.UUCP> work@dragos.UUCP (Dragos Ruiu) writes: :In article <1473@pt.cs.cmu.edu>, tgl@ZOG.CS.CMU.EDU (Tom Lane) writes: :> :> One of the few bad features of LZW is that it is very hard to turn it on and :> off "on the fly" like this. The problem is that the compression and :> decompression symbol tables must be kept in sync, or decompression will be :> impossible. Compressing a chunk of text nearly always gives rise to new :> entries in the table, so you cannot easily compress a packet and then decide :> not to send it in compressed form. (The decompressor must be run against :> the compressed packet in order to update its own table.) :[edit] :> A simple way of doing it would be to flush both tables whenever you decided :> that compression should be temporarily disabled. You then couldn't turn :> compression back on until you had seen some number N of packets that showed :> adequate compression *starting from an empty table*. I suspect that this :> would hurt your overall results badly, but I don't have any data to prove it. :> Anybody know of any actual research that's been done along this line? : : From Welch's original paper (IEEE Computer June 1984), he says that LZW : has a 5K adaptation zone (he was using fixed 12 bit codes). My personal : experience has been that this is a conservative estimate. On my compressed : packet machine experiment we found that on the average it took about 1.5-3K : before compression started to show a win with an empty table. This surprised : us, but we didn't have time to research it thoroughly. So these are all : ballpark guestimates. This was using a fixed 12 bit code with the average : files kicking around a Un*x system. A variable bit implementation should : find the 'win zone' much quicker. : : If you flushed the tables when compression started going awry, you would : would keep compressing but ignoring the compressed output. When the : compression ratio goes back up again to something you like, you flush : the tables again and tell the other side to start decompressing once more. : Using Welch's 5K figure you lose 5K to find out the initial compression : rate has gone back to an acceptable level and you lose another 5 K before : your compression adapts back to a good level. So in this worst case : scenario you send 10K of data that could have been compressed. I think : that real world use would see you lose a max of about 4K of data that could : have been compressed. : : On the plus side though, you gain automagic compression shutoff, and the : benefits in throughput when your data stream sometimes contains : pre-compressed data is a *big* win. To get this win you would have to : keep track of the compression ratios of the last N packets, and only : turn off compression when a history of badly compressed packets shows : up. If your history window is too small, you may take too many 4K 'hits' : to make your compression less effective. But when a xxxK news batch : goes through, the modem is going to have many chances to realize : compression is not a good idea. : : I'll agree with Tom that perhaps asking this of Telebit might be much, as : the problem is more difficult than the off the cuff analysis would indicate. : Maybe they can do it later.... (And maybe they can do V.32 and FAX and : maybe the Usenet deal will come back and maybe I'll someday afford one : :-) : :> tom lane :> Internet: tgl@zog.cs.cmu.edu :> UUCP: <your favorite internet/arpanet gateway>!zog.cs.cmu.edu!tgl :> BITNET: tgl%zog.cs.cmu.edu@cmuccvma : : :-- :Dragos Ruiu ruiu@dragos.UUCP : ...alberta!dragos!ruiu "cat ansi.c | grep -v noalias >proper.c" There is a good solution to this! For each chunk, do the compression. If the result is smaller, use, otherwise send the original. Precede this choice with a one-bit flag to show whether the next chunk is compressed. (In a modem you can get away with sending a one-bit flag, but even if that's too much bother, a whole byte isn't very expensive.) In both cases, you update the tables as if you did the compression. In the receiver, you read and check the flag. If it indicates compression, then uncompress as usual. If not, *COMPRESS* the incoming data to keep the tables up-to-date, but keep the uncompressed data to pass on and throw away the compressed data. For example: original text: ABCD EFGH compress(ABCD,T1) = (MNO,T2), compress(EFGH,T2) = (PQRST,T3) source sends: yMNO nEFGH (y and n are the flags) receiver get this and does: uncompress(MNO,T1) = (ABCD,T2), compress_and_ignore(EFGH,T2) = (EFGH,T3) receiver outputs: ABCD EFGH -- John Macdonald UUCP: {mnetor,utzoo} !spectrix!jmm
tgl@ZOG.CS.CMU.EDU (Tom Lane) (04/26/88)
In article <580@spectrix.UUCP>, John_M@spectrix.UUCP (John Macdonald) writes: > There is a good solution to this! For each chunk, do the compression. > If the result is smaller, use, otherwise send the original. Precede > this choice with a one-bit flag to show whether the next chunk is compressed. > [...] In both cases, you > update the tables as if you did the compression. > > In the receiver, you read and check the flag. If it indicates compression, > then uncompress as usual. If not, *COMPRESS* the incoming data to keep > the tables up-to-date, but keep the uncompressed data to pass on and throw > away the compressed data. Yeah, that's the obvious thing to do. The trouble is that compression and decompression tables are not the same thing; if you think about it, they perform reverse functions, so it's not surprising that they need to be indexed differently. For LZW compression like the TrailBlazer uses, the compression table is usually a hash table indexed by <previous symbol, current character>; while the decompression table can be a simple array indexed by <current symbol>. I haven't come up with any better way of doing this idea than having the receiver maintain BOTH compression and decompression tables, updating both tables whenever a packet is received (using either the compression or decompression algorithm, as required). This is a nontrivial hit in CPU terms, but the place where it really hurts you is memory space. A compression table for 12-bit-symbol LZW (TB algorithm) is 24KB or more, depending on what kind of hash table load factor you're willing to tolerate (= more CPU hit...). A TrailBlazer hasn't got that kind of memory to spare, I would imagine. On the other hand, if you are talking compression/decompression in the mainframe rather than the modem, and you've got oodles of memory space and CPU cycles to play with, you could do it. I guess that's another argument for my original point (which was that the modem is the wrong place to do the compression). Anybody know anything about cheap ways to build doubly-indexed tables?? You could become the next initial in LZW! PS: to really deserve that initial you should also figure out a really good yet cheap algorithm for deciding when to flush the tables... I seem to recall some discussion of that in the COMPRESS mailing list, but I don't think it was ever solved satisfactorily. -- tom lane Internet: tgl@zog.cs.cmu.edu UUCP: <your favorite internet/arpanet gateway>!zog.cs.cmu.edu!tgl BITNET: tgl%zog.cs.cmu.edu@cmuccvma
work@dragos.UUCP (Dragos Ruiu) (04/28/88)
In article <1529@pt.cs.cmu.edu>, tgl@ZOG.CS.CMU.EDU (Tom Lane) writes: > In article <580@spectrix.UUCP>, John_M@spectrix.UUCP (John Macdonald) writes: > [Talks about compressing incoming uncompressed data to keep LZW tables good] > Yeah, that's the obvious thing to do. The trouble is that compression > and decompression tables are not the same thing; if you think about it, > they perform reverse functions, so it's not surprising that they need > to be indexed differently. For LZW compression like the TrailBlazer > uses, the compression table is usually a hash table indexed by > <previous symbol, current character>; while the decompression table can > be a simple array indexed by <current symbol>. > > I haven't come up with any better way of doing this idea than having > the receiver maintain BOTH compression and decompression tables, > updating both tables whenever a packet is received (using either the > compression or decompression algorithm, as required). This is a > nontrivial hit in CPU terms, but the place where it really hurts you is > memory space. A compression table for 12-bit-symbol LZW (TB algorithm) > is 24KB or more, depending on what kind of hash table load factor > you're willing to tolerate (= more CPU hit...). A TrailBlazer hasn't > got that kind of memory to spare, I would imagine. > > -- > tom lane > Internet: tgl@zog.cs.cmu.edu > UUCP: <your favorite internet/arpanet gateway>!zog.cs.cmu.edu!tgl > BITNET: tgl%zog.cs.cmu.edu@cmuccvma First a couple of points: -When decide to turn compression off, keeping the tables updated is probably of little significance, because the codes being built are useless. -Last I heard the Telebit was reputed to have 512K of ram. You can keep a compression table and a decompression table simultaneously. Simply have the hash table indicate entries in the (de)compression table instead of keeping the code and the follow. (This saves you memory even if you don't care about keeping a decompression table.) The disadvantage of this approach is that you take an extra indirection hit on every hash table lookup. The hash table gives you the entry number, which you must then look up in the entry table. A minimal memory setup for bi-directional compression is about 32K: 10K decompression table: 4096 * (1.5 byte code + 1 byte follow) 10K compression entry table: " " 12K hash table: 8096 * 1.5 byte compression table entry numbers. This has a 50% hash table occupancy. And if you wanted to do 'compression' on the decompressed input like John MacDonald suggested, you have to add another 12K hash table that points to entries in the decompression table. -- Dragos Ruiu ruiu@dragos.UUCP "One can ceratinly imagine the myriad of ...alberta!dragos!ruiu uses for a hand-held iguana maker." -Hobbes