[net.micro.cpm] Mini text regarding SQ,ARC,LZW

EIBEN@dec-marlboro.ARPA (Bernie Eiben - LDP Workstations) (06/09/86)
Since somebody asked about differences between Huffman encoding and
ARC [or whats the difference between a cherry and an apple-pie] - I
decided to give my 'two cents' worth. There are LONG articles and
dissertations floating around -- but who reads them...

		A mini-introduction into ARC

ARC in some way is comparable to LSWEEP or LU in that it is a
PACKAGING method.  Files with extension ARC contain a 'marker',
followed by file information, file-data, file information, file-data
etc.

ARC's packaging method guarantees NO GROWTH during storage , i.e.
file contents are analyzed before storage and either stored

1. AS IS {typically files in the 1 to 200 byte range}
2. with repeat-compression {same range as above}
3. using Huffman 8-byte encoding {sometimes executables}
4. using Lempel-Ziv-Welch encoding {all others}

ARC free's the user from 'worrying' about storage mechanisms and
delivers practically all needed services {extract, store, list, type,
check, execute and re-compress using 'latest' state of compression
technique}.  ARC is 'downward' compatible.  It is currently heavily
used in the MSDOS/PCDOS world, although usage in RCPM systems is
starting with availability of a fast DE-ARCer {a CP/M version of ARC
is 'in the works' by Bob Freed}.

ARC belongs into the category of "Share-ware" or "Free-ware" - it is
copyrighted by System Enhancement Associates {source-language C,
system MSDOS} - UnARC was written by Bob Freed for the Public Domain
{source-language assembler, system CP/M}.

		A mini comparison of Huffman Encoding
				 and
		  Lempel-Ziv-Welch {LZW} techniques

Huffman Encoding expresses each storage unit as a variable length
pointer into a frequency-ordered tree.  Compression is achieved by
choosing a 'native' storage unit {where repetitions are bound to
occur} and {on the average} expressing the more frequent storage units
with shorter pointers [although less used units might be presented by
longer pointers].  The Encoding process needs 'two passes' i.e.  once
reading all units {under CP/M and MSDOS 8bit bytes} to build the
frequency ordered tree {also called the 'dictionary'} and then
translating all units into their respective pointer values.  Original
filename, dictionary and pointer values are stored - by convention the
SECOND character of the filename extension is changed to Q - reminder
of a 'squeezed' file.

LZW expresses strings of 8-bit bytes by pointers into an 'ordered'
string-table.  The rules for 'constructing' the table are reversible,
so that Compressor and De-Compressor can 'build' their table
'on-the-fly'.  LZW is 'one-pass' although  achieved  speed is VERY
dependent on language implementation and available physical memory [in
general more than 90% of time spent in 'hashing' and table searching].
Although early implementations of LZW seemed to need more than 64K of
physical memory, current enhancements make a maximum of 2**11 table
entries sufficient to handle all cases.  State of the art
implementations check 'compression ratio' on the fly - and rebuild the
table if compression ratio decreases beyond a minimum or rebuild the
table on table overflow.

Typical Huffman compression ratios hover around 33% {compressed file
is 66% of original, whereby 'text' is typically compressed a little
better, and 'executables' less}.  Typical LZW compression ratios
average 55% - highest compression is achieved with pixel-information
{values of 90% are typical} - followed by 'text' with 50% and
executables around 20%.  Although the original 'paper' on LZW
suggested implementation between CPU and peripheral devices
[terminal,disk-drives,mag-tapes] - current usage encompasses
file-compression {Unix COMPRESS, MSDOS ARC, CPM UNArc} - highspeed
proprietary MODEM-protocols {"LZW in SILICON"} and 'picture
transmission' at 1200 baud.

Rgds,
Bernie