W8SDZ@SIMTEL20.ARPA (Keith Petersen) (12/07/86)
Here is another old DOC file from the original release of CRUNCH/UNCRUNCH, DETAILS.DOC. --Keith ***************************************************************** * * * LZW "Cruncher" Data Compressor Utility * * LZW "Un-Cruncher" Data Decompressor Utility * * * * Z-80 Only, CP/M 2.2+ * * v1.0 3/30/86 * * -Steven Greenberg * ***************************************************************** This document is intended to supplement the accompanying ABSTRACT.DOC for those interested in some more technical details. As mentioned, the cruncher is based an Kent Williams' imp- lementation of the Lempel / Zev algorithm. For further infor- mation on the algorithm itself, I refer you to his public domain file LZW2COM.LBR which contains a description of the technique and an actual implementation written in "C" source. In order to make a practical stand alone "cruncher" that was easy to use, especially for those already familiar with squeez- ers, some header information had to be included in the resulting "crunched" file (eg. the filename of the original file, etc.). I have defined a header based on the time tested squeezed file format, with some necessary changes and a few additions. The ad- ditions are mostly to insure that files crunched now will always be un-crunchable with future versions of the uncruncher, no mat- ter what possible enhancements are made. Those familiar with the MS-DOS ARC.xxx program have probably seen this idea in action. More on this later. Another slight problem with LZWCOM & LZWUNC had to do with the question of termination. When the input file was exhausted during compression, it was unlikely the output file was on a sector boundary. No matter what the rest of the final output sector was padded with ("1A"'s were used), the uncruncher would try to uncrunch those bytes (since all data is conceivably val- id). This resulted in occasional extra sectors of garbage following an otherwise properly decoded file. While this did not usually cause a problem, it was certainly not desirable. I have chosen to handle the termination problem the same way it was handled with squeezed files; by dedicating a unique code to represent EOF (End Of Field). By only allowing 4095 instead of 4096 different codes (not a major shortcoming), code 000 can become a dedicated EOF. As soon as it is encountered on the input file, the decoding process is known to be complete. For those who are interested, the exact code put out by CRUNCH can be duplicated by the "C" program LZWCOM if table entry zero "artifi- cially" flagged as "used" (before initializing the table). That insures that the code will never come up, except when manually inserted at the end of file. The other functional difference from LZWCOM involves repeat byte coding. CRUNCH converts the "physical input stream" into a "logical input stream", which is then handed to the cruncher. The conversion takes 3 or more contiguous occurrences of the same byte and encodes them as <byte> "90H" <count> where "count" is the number of "additional" occurrences of <byte> (ie total occur- rences -1). 90H itself is encoded as "90H" ,"00". This scheme is identical to that used in standard squeezing. Crunching requires only one pass through the input file, while squeezing requires two. While this is one of its signif- icant advantages, it does complicate the problem of including a checksum, if one is desired, in the header of the result file (since the value is not known until everything is done). A bad solution is to close the finished output file, re-open it, insert the checksum, and close it again. A good solution is to put the checksum at the end of the output file right after the EOF. And that's where it is. With all this in mind, herein follows a specification for the format of a crunched file. --------------------------------------------- ID FIELD: Bytes 0 and 1 are always 076H and 0FEH, respec- tively. This identifies the file as "crunched". FILENAME: The filename field starts at byte 2. It is a field of variable length, terminated by a zero byte. The field contains the filename as ASCII chars, including an ASCII "." immediately preceding the filename's extension. Less than eight characters may precede the "."; there is no necessity to pad the filename with blanks. Additional characters after the 3rd exten- sion character but before the zero byte specifically are allowed and will be ignored by the current uncruncher. This allows an area of unlimited size for date stamping, or other miscellaneous information which a future cruncher or application program might want to insert, for use or display by some uncrunching program. By skipping over these bytes now, future incompatibilities are eliminated. Following the zero byte are the following 4 bytes, in order: REFERENCE REVISION LEVEL: 1 byte } SIGNIFICANT REVISION LEVEL: 1 byte } described later ERROR DETECTION TYPE: 1 byte } SPARE: 1 byte } CRUNCHED OUTPUT: After the SPARE byte, the actual crunched output finally begins. The crunched output is a series of 12-bit codes in "natural" order. (Every other 12-bit code starts on a byte boundary and includes the 4 ms bits of the next byte. The "odd" codes start in the middle of a byte and include the whole following byte as the remaining 8 ls bits). A 12-bit code of 000 is an EOF, as explained above. If the EOF code itself ends in the middle of a byte, an additional 4 bits of zero are padded on to get back on a byte boundary for the checksum. CHECKSUM: The next two bytes are the 16-bit checksum, least significant byte first. The checksum is the modulo 2^16 sum of all the bytes as input from the physical input stream, prior to repeat byte encoding (or, in the case of uncrunching, as output to the physical output stream, after repeat byte decoding). REMAINDER OF THE SECTOR: The remaining bytes in the sector following the checksum are irrelevant. CRUNCH fills them with "1A"'s. --------------------------------------------- These are the four bytes not fully described above: "Reference Revision Level": The program/revision level of the program that performed the crunch operation. This byte is put in for general reference only. The current value is "10" (hex). "Significant Revision Level": If the value of this byte in a crunched data file exceeds the value contained within the un- crunching program, the message "File requires newer revision of program" will be displayed. If changes or enhancements are ever made to CRUNCH which are significant enough to actually output an incompatible file, the information in this byte will allow a new revision of UNCR to be compatible with all existing data files, old or new. The error message gets displayed only if someone tries to uncrunch a new file with an old uncruncher which doesn't know about the "future" format yet. Current value is "10" (hex). "Error Detection Type": If this value is non-zero, the cur- rent uncruncher will not examine the checksum or give an error associated with it. This will permit a CRC type (or no error checking) value to be used if circumstances warrant it. The cur- rent UNCR program is always checking for "illegal" codes, which are ones which have not been defined by previous codes. If any are encountered, the message "Invalid Crunched File" is disp- layed. This inherent self-checking probably precludes the neces- sity of more advanced error checking. "Spare": The SPARE byte is a spare byte. - Steven Greenberg 30 March 1986
W8SDZ@SIMTEL20.ARPA (Keith Petersen) (12/07/86)
The original LZWCOM program referred to in Steven Greenberg's description of his CRUNCH/UNCRUNCH program is available from SIMTEL20 as: Filename Type Bytes CRC Directory PD:<CPM.SQUSQ> LZW.LBR.1 BINARY 50816 0D23H This LBR contains the following files: COMMLZW.C LZW.C LZW.DQC LZW.SUB LZWCOM.COM LZWUNC.C LZWUNC.COM Please remember that Greenberg has pointed out some shortcomings and problems with the original LZWCOM approach. However, this should be a good starting place for making a Unix C version of Greenberg's enhanced cruncher. The LZW.DOC file says that the C programs can be compiled on Unix and should be reasonably portable. Please keep me posted on your progress. I need a C version myself for use here on SIMTEL20 to maintain the crunched files. --Keith