EIBEN@dec-marlboro.ARPA (Bernie Eiben - LDP Workstations) (06/09/86)
Since somebody asked about differences between Huffman encoding and ARC [or whats the difference between a cherry and an apple-pie] - I decided to give my 'two cents' worth. There are LONG articles and dissertations floating around -- but who reads them... A mini-introduction into ARC ARC in some way is comparable to LSWEEP or LU in that it is a PACKAGING method. Files with extension ARC contain a 'marker', followed by file information, file-data, file information, file-data etc. ARC's packaging method guarantees NO GROWTH during storage , i.e. file contents are analyzed before storage and either stored 1. AS IS {typically files in the 1 to 200 byte range} 2. with repeat-compression {same range as above} 3. using Huffman 8-byte encoding {sometimes executables} 4. using Lempel-Ziv-Welch encoding {all others} ARC free's the user from 'worrying' about storage mechanisms and delivers practically all needed services {extract, store, list, type, check, execute and re-compress using 'latest' state of compression technique}. ARC is 'downward' compatible. It is currently heavily used in the MSDOS/PCDOS world, although usage in RCPM systems is starting with availability of a fast DE-ARCer {a CP/M version of ARC is 'in the works' by Bob Freed}. ARC belongs into the category of "Share-ware" or "Free-ware" - it is copyrighted by System Enhancement Associates {source-language C, system MSDOS} - UnARC was written by Bob Freed for the Public Domain {source-language assembler, system CP/M}. A mini comparison of Huffman Encoding and Lempel-Ziv-Welch {LZW} techniques Huffman Encoding expresses each storage unit as a variable length pointer into a frequency-ordered tree. Compression is achieved by choosing a 'native' storage unit {where repetitions are bound to occur} and {on the average} expressing the more frequent storage units with shorter pointers [although less used units might be presented by longer pointers]. The Encoding process needs 'two passes' i.e. once reading all units {under CP/M and MSDOS 8bit bytes} to build the frequency ordered tree {also called the 'dictionary'} and then translating all units into their respective pointer values. Original filename, dictionary and pointer values are stored - by convention the SECOND character of the filename extension is changed to Q - reminder of a 'squeezed' file. LZW expresses strings of 8-bit bytes by pointers into an 'ordered' string-table. The rules for 'constructing' the table are reversible, so that Compressor and De-Compressor can 'build' their table 'on-the-fly'. LZW is 'one-pass' although achieved speed is VERY dependent on language implementation and available physical memory [in general more than 90% of time spent in 'hashing' and table searching]. Although early implementations of LZW seemed to need more than 64K of physical memory, current enhancements make a maximum of 2**11 table entries sufficient to handle all cases. State of the art implementations check 'compression ratio' on the fly - and rebuild the table if compression ratio decreases beyond a minimum or rebuild the table on table overflow. Typical Huffman compression ratios hover around 33% {compressed file is 66% of original, whereby 'text' is typically compressed a little better, and 'executables' less}. Typical LZW compression ratios average 55% - highest compression is achieved with pixel-information {values of 90% are typical} - followed by 'text' with 50% and executables around 20%. Although the original 'paper' on LZW suggested implementation between CPU and peripheral devices [terminal,disk-drives,mag-tapes] - current usage encompasses file-compression {Unix COMPRESS, MSDOS ARC, CPM UNArc} - highspeed proprietary MODEM-protocols {"LZW in SILICON"} and 'picture transmission' at 1200 baud. Rgds, Bernie