rhorn@infinet.UUCP (Rob Horn) (06/23/86)
I am preparing a set of compression routines that is aimed at helping reduce the CPU and disk requirements for news and mail transfer. They will be submitted to mod.sources in late-July. These routines are based upon the observation that both news and mail are formed of two parts: <header> - which *must* be modified by each relay site <body> - which can be transferred as is. The present compression system compresses many messages at one time, and these must all be decompressed at each site so that the headers can be processed. The new routines allow very fast compression and decompression of headers. The body can be compressed at the sender site and left alone until it reaches its destination. This lets the queues remain compressed, and keeps down the compression costs. The problem posed by this approach is that the L-Z algorithm of ``compress'' is very effective only when compressing large files (5,000+ bytes). For very small files - like headers and short messages - it does very poorly because it has insufficient data to build its compression strings. Similarly, ``compact'' is also adaptive and very ineffective on small files. Only a technique that is pre-set for specific file characteristics can give good results on small files. I am preparing two pre-set algorithms: one for headers, assuming RFC + Usenet extensions, and one for English text. I anticipate that these routines will be used in a system that prepends compression information to the mail or news block, identifying the compression method used for both header and contents. This will allow senders of large messages, especially source code and encoded binaries, to continue to use the L-Z ``compress''. My routines will help somewhat on source, not at all on binaries. And, on any large message L-Z will provide superior compression because it does adapt to the message characteristics. The routines that I am preparing provide the following compressions: headers - 30-50% compression (Rather sensitive to the amount of subject and route information) English - 30-40% compression They are also insensitive to transmission garbles in that a destroyed byte only destroys a finite (usually small) stretch of the file. Then the compression returns without errors. They do impose one major restriction - their input must be limited to 7-bit ASCII. Removing this would have major speed and portability impact. They will be in three forms: - Internal string compress/decompress - I/O routines with interfaces very similar to getc and putc - compress/decompress filters. There will be a means of adjusting some of the compression algorithm. It is driven by tables of common strings, common letter pairs, and a few special cases. Coordinating compression algorithms like this will be *very hard*. In many cases the compression control information is larger than the file, so it defeats the purpose to send it along with the file. I see no easy way out. If there are changes that could be made to help this process please let me know. For more details, or to comment on the design and interfaces, please send mail. -- Rob Horn UUCP: ...{decvax, seismo!harvard}!wanginst!infinet!rhorn Snail: Infinet, 40 High St., North Andover, MA