[comp.archives] [rec.ham-radio.packet] LZW compression implementation available

klemets@sics.se (Anders Klemets) (10/03/90)

Archive-name: lzw-nos/02-Oct-90
Original-posting-by: klemets@sics.se (Anders Klemets)
Original-subject: LZW compression implementation available
Archive-site: sics.se [192.16.123.90]
Archive-directory: /archive/packet/ka9q/nos
Reposted-by: emv@math.lsa.umich.edu (Edward Vielmetti)

This message is to announce the availability of version 3 of my LZW
compression protocol. The current implementation is for the KA9Q
Internet Package (NOS) where it can operate on TCP, NET/ROM and AX.25
connections. But it should be possible to port this implementation to
other software, such as dedicated BBS'es.

The compression algorithm works similar to the way GIF files are
compressed. The major difference between straight LZW and GIF is that
GIF uses a variable size for the codewords. My implementation also
uses variable size codewords.

The compression routine looks at data as 8 bit words and begins coding
input strings as 9 bit codewords. As the codewords get used up, it
increases the size to 10 bits, and so on. When the maximum codeword
has been reached, it clears the code table and restarts with 9 bit
long codewords. 

It is possible for an application in NOS to specify the maximum number
of bits to use for codewords, and hence, what the maximum codeword
will be. The valid range is 9 to 16 bits. At 16 bits, there can be
65535 different codewords. That might not seem like very much at a
first glance, but at that point NOS will use 192 kbytes to keep the
code table, which should make things pretty tight.

I have tried to optimize my implementation in a memory efficient way,
that will however make the compression somewhat slow. But since it is
intended for 1200 Baud AX.25 links or SLIP links, I guess it will not
matter all that much. For those who are concerned about processing
speed, it is possible to specify an alternate code table algorithm.
It computes faster, but needs more memory. A code table using the
"compact" algorithm may grow to about (2^bits * 3) bytes, where 'bits'
is the maximum number of bits a codeword may use. The "fast" algorithm
may have to use up to (2^bits * 5) bytes.

Which algorithm to use, and how to set the 'bits' limit, is a policy
decision. A larger 'bits' value increases the efficiency of the
compression algorithm, but it uses more memory and takes longer to
compute. The "fast" coding algorithm reduces the computing time, but
needs even more memory.

LZW processes groups of characters when encoding. This is no problem
when compressing files, but on a network connection there are times
when this approach is not practical. For instance, when you press the
return key, you probably want what you have typed to be sent
immediately. But LZW wants to hold on to the data because it needs to
know what you intend to type after you have pressed the return key, so
that it can compress what you typed more efficiently.
I have solved this by interrupting coding whenever the output buffer
is flushed (by pressing the return key for instance) and signaling this
with a special flush codeword.

LZW is somewhat inefficient on interactive terminal traffic, since
there is lots of pressing of the return key. It works better for batch
traffic, such as mail forwarding, where packets are longer than 80
characters.
It is especially inefficient when one operates in remote echo mode
("full duplex") because then every coded character has to be followed
by the flush codeword. This is particular mode of operation makes my
LZW two or three times less efficient than without "compression."

There is also a special codeword ZCC which will cause an immediate
clearing of the recievers code table and setting of the current bit
size to 9 bits. This could be used to preempt low priority transfers
when NOS needs more memory for other tasks.

The LZW implementation sits in the socket library, and works on top of
any stream based socket, such as TCP, AX.25 or NETROM.
All the application needs to do is to make the following call:

		 lzwinit(s,bits,mode);

where 's' is the socket descriptor. 'bits' is an integer between 9 and
16 specifying the maximum number of bits to use for each codeword.
The compression code table will always use this or a lower size, but
'bits' should only be regarded as a recommendation with respect to the
decompression code table. 'mode' specifies the compression algorithm
and should be either LZWCOMPACT or LZWFAST. Data should be sent and
recevied with usprintf(), usputc(), recvchar(), recvline(), etc. Calls
to send(), send_mbuf(), recv() or recv_mbuf() will bypass the coding,
which obviously will cause strange effects.

One of the better uses for the LZW compression is to transfer mail. It
should be noted that a connection for transfering mail might be left
connected even while there is no mail to transfer, to save the code
table. It would be very inefficient to transfer only one single piece
of mail for each connection, since the LZW code table has to be
rebuilt each time.

In the case of transfering mail with SMTP, I see two different
approaches. One is to have a special ZSMTP server that runs on a
different TCP port. The other is to have the client connect without
using compression, and then issue a special command, such as "XLZW."
If it is accepted, transfering of compressed mail can begin. However,
some SMTP server implementations panic when they get an unknown
command and close the connection, instead of sending a "500 command
unrecognized" response.

The source for the LZW implementation is available with ftp from
sics.se. The filename is archive/packet/ka9q/nos/lzw.arc. The files in
this archive are supposed to be merged with a recent version of the
NOS source code which is available from thumper.bellcore.com.
The archive contains no applications using LZW however, modifying
Telnet or SMTP to use the compression is left as an exercise for the
reader.

Anders, SM0RGV
klemets@sics.se