[net.news] Compression routines to reduce mail & news costs

rhorn@infinet.UUCP (Rob Horn) (06/23/86)

I am preparing a set of compression routines that is aimed at
helping reduce the CPU and disk requirements for news and mail
transfer.  They will be submitted to mod.sources in late-July.
 
These routines are based upon the observation that both news and
mail are formed of two parts:
    <header>  -  which *must* be modified by each relay site
    <body>    -  which can be transferred as is.
 
The present compression system compresses many messages at one
time, and these must all be decompressed at each site so that the
headers can be processed. The new routines allow very fast
compression and decompression of headers.  The body can be
compressed at the sender site and left alone until it reaches
its destination.  This lets the queues remain compressed, and
keeps down the compression costs.
 
The problem posed by this approach is that the L-Z algorithm of
``compress'' is very effective only when compressing large files
(5,000+ bytes).  For very small files - like headers and short
messages - it does very poorly because it has insufficient data
to build its compression strings.  Similarly, ``compact'' is also
adaptive and very ineffective on small files.  Only a technique
that is pre-set for specific file characteristics can give good
results on small files.  I am preparing two pre-set algorithms:
one for headers, assuming RFC + Usenet extensions, and one
for English text.
 
I anticipate that these routines will be used in a system that
prepends compression information to the mail or news block,
identifying the compression method used for both header and
contents.  This will allow senders of large messages, especially
source code and encoded binaries, to continue to use the L-Z
``compress''.  My routines will help somewhat on source, not at all
on binaries.  And, on any large message L-Z will provide superior
compression because it does adapt to the message characteristics.
 
The routines that I am preparing provide the following
compressions:
 
  headers  -  30-50% compression (Rather sensitive to the amount
                of subject and route information)
  English  -  30-40% compression
 
They are also insensitive to transmission garbles in that a
destroyed byte only destroys a finite (usually small)
stretch of the file.  Then the compression returns without
errors.  They do impose one major restriction - their input must
be limited to 7-bit ASCII.  Removing this would have major speed
and portability impact.
 
They will be in three forms:
   -  Internal string compress/decompress
   -  I/O routines with interfaces very similar to getc and putc
   -  compress/decompress filters.
There will be a means of adjusting some of the compression
algorithm.  It is driven by tables of common strings, common
letter pairs, and a few special cases.
 
Coordinating compression algorithms like this will be *very
hard*.  In many cases the compression control information is
larger than the file, so it defeats the purpose to send it along
with the file.  I see no easy way out.  If there are changes that
could be made to help this process please let me know.
 
For more details, or to comment on the design and interfaces,
please send mail.
-- 
				Rob  Horn
	UUCP:	...{decvax, seismo!harvard}!wanginst!infinet!rhorn
	Snail:	Infinet,  40 High St., North Andover, MA