[comp.os.cpm] Towards a UNIX equivalent for crunch 2.3

W8SDZ@SIMTEL20.ARPA (Keith Petersen) (12/07/86)

Here is another old DOC file from the original release of
CRUNCH/UNCRUNCH, DETAILS.DOC.

--Keith

*****************************************************************
*                                                               *
*            LZW "Cruncher"   Data Compressor Utility           *
*           LZW "Un-Cruncher" Data Decompressor Utility         *
*                                                               *
*                     Z-80 Only, CP/M 2.2+                      *
*                         v1.0  3/30/86                         *
*                                            -Steven Greenberg  *
*****************************************************************


     This  document  is intended to supplement  the  accompanying 
ABSTRACT.DOC for those interested in some more technical details.


     As  mentioned, the cruncher is based an Kent Williams'  imp-
lementation  of the Lempel / Zev algorithm.  For  further  infor-
mation on the algorithm itself, I refer you to his public  domain 
file  LZW2COM.LBR which contains a description of  the  technique 
and an actual implementation written in "C" source.

     In order to make a practical stand alone "cruncher" that was 
easy  to use, especially for those already familiar with  squeez-
ers, some header information had to be included in the  resulting 
"crunched" file (eg. the filename of the original file, etc.).  I 
have  defined  a header based on the time  tested  squeezed  file 
format, with some necessary changes and a few additions.  The ad-
ditions are mostly to insure that files crunched now will  always 
be un-crunchable with future versions of the uncruncher, no  mat-
ter what possible enhancements are made.  Those familiar with the 
MS-DOS  ARC.xxx program have probably seen this idea  in  action.  
More on this later.

     Another  slight problem with LZWCOM & LZWUNC had to do  with 
the  question of termination.  When the input file was  exhausted 
during  compression,  it was unlikely the output file  was  on  a 
sector  boundary.   No matter what the rest of the  final  output 
sector  was padded with ("1A"'s were used), the uncruncher  would 
try  to uncrunch those bytes (since all data is conceivably  val-
id).   This  resulted  in occasional  extra  sectors  of  garbage 
following an otherwise properly decoded file.  While this did not 
usually cause a problem, it was certainly not desirable.

     I have chosen to handle the termination problem the same way 
it was handled with squeezed files;  by dedicating a unique  code 
to  represent EOF (End Of Field).  By only allowing 4095  instead 
of  4096 different codes (not a major shortcoming), code 000  can 
become  a  dedicated EOF.  As soon as it is  encountered  on  the 
input  file, the decoding process is known to be  complete.   For 
those who are interested, the exact code put out by CRUNCH can be 
duplicated by the "C" program LZWCOM if table entry zero "artifi-
cially" flagged as "used" (before initializing the table).   That 
insures  that the code will never come up, except  when  manually 
inserted at the end of file.

     The other functional difference from LZWCOM involves  repeat 
byte coding.  CRUNCH converts the "physical input stream" into  a 
"logical  input  stream", which is then handed to  the  cruncher.  
The conversion takes 3 or more contiguous occurrences of the same 
byte  and encodes them as <byte> "90H" <count> where  "count"  is 
the number of "additional" occurrences of <byte> (ie total occur-
rences  -1).  90H itself is encoded as "90H" ,"00".  This  scheme 
is identical to that used in standard squeezing.

     Crunching  requires  only one pass through the  input  file, 
while  squeezing requires two.  While this is one of its  signif-
icant  advantages, it does complicate the problem of including  a 
checksum,  if  one is desired, in the header of the  result  file 
(since  the value is not known until everything is done).  A  bad 
solution is to close the finished output file, re-open it, insert 
the checksum, and close it again.  A good solution is to put  the 
checksum at the end of the output file right after the EOF.   And 
that's  where  it is.  With all this in mind,  herein  follows  a 
specification for the format of a crunched file.

          ---------------------------------------------

     ID  FIELD:  Bytes 0 and 1 are always 076H and 0FEH,  respec-
tively.  This identifies the file as "crunched".

     FILENAME:  The  filename field starts at byte 2.   It  is  a 
field  of variable length, terminated by a zero byte.  The  field 
contains  the  filename as ASCII chars, including  an  ASCII  "." 
immediately preceding the filename's extension.  Less than  eight 
characters may precede the ".";  there is no necessity to pad the 
filename with blanks.  Additional characters after the 3rd exten-
sion character but before the zero byte specifically are  allowed 
and  will be ignored by the current uncruncher.  This  allows  an 
area of unlimited size for date stamping, or other  miscellaneous 
information which a future cruncher or application program  might 
want  to insert, for use or display by some uncrunching  program.  
By  skipping over these bytes now, future  incompatibilities  are 
eliminated.

     Following the zero byte are the following 4 bytes, in order:

     REFERENCE REVISION LEVEL:   1 byte  }
     SIGNIFICANT REVISION LEVEL: 1 byte  } described later
     ERROR DETECTION TYPE:       1 byte  }
     SPARE:                      1 byte  }

     CRUNCHED  OUTPUT: After the SPARE byte, the actual  crunched 
output finally begins.  The crunched output is a series of 12-bit 
codes  in "natural" order.  (Every other 12-bit code starts on  a 
byte  boundary and includes the 4 ms bits of the next byte.   The 
"odd"  codes start in the middle of a byte and include the  whole 
following byte as the remaining 8 ls bits).  A 12-bit code of 000 
is  an EOF, as explained above.  If the EOF code itself  ends  in 
the middle of a byte, an additional 4 bits of zero are padded  on 
to get back on a byte boundary for the checksum. 

     CHECKSUM: The next two bytes are the 16-bit checksum,  least 
significant  byte first.  The checksum is the modulo 2^16 sum  of 
all  the bytes as input from the physical input stream, prior  to 
repeat  byte encoding (or, in the case of uncrunching, as  output 
to the physical output stream, after repeat byte decoding).

     REMAINDER  OF THE SECTOR: The remaining bytes in the  sector 
following  the checksum are irrelevant.  CRUNCH fills  them  with 
"1A"'s.

          ---------------------------------------------

     These are the four bytes not fully described above:

     "Reference  Revision Level":  The program/revision level  of 
the  program that performed the crunch operation.  This  byte  is 
put  in  for general reference only. The current  value  is  "10" 
(hex).

     "Significant Revision Level":  If the value of this byte  in 
a  crunched data file exceeds the value contained within the  un-
crunching  program, the message "File requires newer revision  of 
program" will be displayed.  If changes or enhancements are  ever 
made to CRUNCH which are significant enough to actually output an 
incompatible file, the information in this byte will allow a  new 
revision  of UNCR to be compatible with all existing data  files, 
old  or  new.  The error message gets displayed only  if  someone 
tries to uncrunch a new file with an old uncruncher which doesn't 
know about the "future" format yet. Current value is "10" (hex).

     "Error Detection Type": If this value is non-zero, the  cur-
rent  uncruncher will not examine the checksum or give  an  error 
associated  with  it.  This will permit a CRC type (or  no  error 
checking) value to be used if circumstances warrant it. The  cur-
rent  UNCR program is always checking for "illegal" codes,  which 
are  ones which have not been defined by previous codes.  If  any 
are  encountered,  the message "Invalid Crunched File"  is  disp-
layed. This inherent self-checking probably precludes the  neces-
sity of more advanced error checking.

     "Spare":  The SPARE byte is a spare byte.

                                             - Steven Greenberg
                                               30 March 1986

W8SDZ@SIMTEL20.ARPA (Keith Petersen) (12/07/86)

The original LZWCOM program referred to in Steven Greenberg's
description of his CRUNCH/UNCRUNCH program is available from
SIMTEL20 as:

Filename			Type	 Bytes	 CRC

Directory PD:<CPM.SQUSQ>
LZW.LBR.1			BINARY	 50816  0D23H

This LBR contains the following files:

 COMMLZW.C
 LZW.C
 LZW.DQC
 LZW.SUB
 LZWCOM.COM
 LZWUNC.C
 LZWUNC.COM

Please remember that Greenberg has pointed out some shortcomings and
problems with the original LZWCOM approach.  However, this should be a
good starting place for making a Unix C version of Greenberg's
enhanced cruncher.  The LZW.DOC file says that the C programs can be
compiled on Unix and should be reasonably portable.

Please keep me posted on your progress.  I need a C version myself for
use here on SIMTEL20 to maintain the crunched files.

--Keith