klemets@sics.se (Anders Klemets) (10/03/90)
Archive-name: lzw-nos/02-Oct-90 Original-posting-by: klemets@sics.se (Anders Klemets) Original-subject: LZW compression implementation available Archive-site: sics.se [192.16.123.90] Archive-directory: /archive/packet/ka9q/nos Reposted-by: emv@math.lsa.umich.edu (Edward Vielmetti) This message is to announce the availability of version 3 of my LZW compression protocol. The current implementation is for the KA9Q Internet Package (NOS) where it can operate on TCP, NET/ROM and AX.25 connections. But it should be possible to port this implementation to other software, such as dedicated BBS'es. The compression algorithm works similar to the way GIF files are compressed. The major difference between straight LZW and GIF is that GIF uses a variable size for the codewords. My implementation also uses variable size codewords. The compression routine looks at data as 8 bit words and begins coding input strings as 9 bit codewords. As the codewords get used up, it increases the size to 10 bits, and so on. When the maximum codeword has been reached, it clears the code table and restarts with 9 bit long codewords. It is possible for an application in NOS to specify the maximum number of bits to use for codewords, and hence, what the maximum codeword will be. The valid range is 9 to 16 bits. At 16 bits, there can be 65535 different codewords. That might not seem like very much at a first glance, but at that point NOS will use 192 kbytes to keep the code table, which should make things pretty tight. I have tried to optimize my implementation in a memory efficient way, that will however make the compression somewhat slow. But since it is intended for 1200 Baud AX.25 links or SLIP links, I guess it will not matter all that much. For those who are concerned about processing speed, it is possible to specify an alternate code table algorithm. It computes faster, but needs more memory. A code table using the "compact" algorithm may grow to about (2^bits * 3) bytes, where 'bits' is the maximum number of bits a codeword may use. The "fast" algorithm may have to use up to (2^bits * 5) bytes. Which algorithm to use, and how to set the 'bits' limit, is a policy decision. A larger 'bits' value increases the efficiency of the compression algorithm, but it uses more memory and takes longer to compute. The "fast" coding algorithm reduces the computing time, but needs even more memory. LZW processes groups of characters when encoding. This is no problem when compressing files, but on a network connection there are times when this approach is not practical. For instance, when you press the return key, you probably want what you have typed to be sent immediately. But LZW wants to hold on to the data because it needs to know what you intend to type after you have pressed the return key, so that it can compress what you typed more efficiently. I have solved this by interrupting coding whenever the output buffer is flushed (by pressing the return key for instance) and signaling this with a special flush codeword. LZW is somewhat inefficient on interactive terminal traffic, since there is lots of pressing of the return key. It works better for batch traffic, such as mail forwarding, where packets are longer than 80 characters. It is especially inefficient when one operates in remote echo mode ("full duplex") because then every coded character has to be followed by the flush codeword. This is particular mode of operation makes my LZW two or three times less efficient than without "compression." There is also a special codeword ZCC which will cause an immediate clearing of the recievers code table and setting of the current bit size to 9 bits. This could be used to preempt low priority transfers when NOS needs more memory for other tasks. The LZW implementation sits in the socket library, and works on top of any stream based socket, such as TCP, AX.25 or NETROM. All the application needs to do is to make the following call: lzwinit(s,bits,mode); where 's' is the socket descriptor. 'bits' is an integer between 9 and 16 specifying the maximum number of bits to use for each codeword. The compression code table will always use this or a lower size, but 'bits' should only be regarded as a recommendation with respect to the decompression code table. 'mode' specifies the compression algorithm and should be either LZWCOMPACT or LZWFAST. Data should be sent and recevied with usprintf(), usputc(), recvchar(), recvline(), etc. Calls to send(), send_mbuf(), recv() or recv_mbuf() will bypass the coding, which obviously will cause strange effects. One of the better uses for the LZW compression is to transfer mail. It should be noted that a connection for transfering mail might be left connected even while there is no mail to transfer, to save the code table. It would be very inefficient to transfer only one single piece of mail for each connection, since the LZW code table has to be rebuilt each time. In the case of transfering mail with SMTP, I see two different approaches. One is to have a special ZSMTP server that runs on a different TCP port. The other is to have the client connect without using compression, and then issue a special command, such as "XLZW." If it is accepted, transfering of compressed mail can begin. However, some SMTP server implementations panic when they get an unknown command and close the connection, instead of sending a "500 command unrecognized" response. The source for the LZW implementation is available with ftp from sics.se. The filename is archive/packet/ka9q/nos/lzw.arc. The files in this archive are supposed to be merged with a recent version of the NOS source code which is available from thumper.bellcore.com. The archive contains no applications using LZW however, modifying Telnet or SMTP to use the compression is left as an exercise for the reader. Anders, SM0RGV klemets@sics.se