[comp.sys.ibm.pc] Squeeze compression utili

jack.bzoza@canremote.uucp (JACK BZOZA) (07/30/89)

To: jasmith@watcgl.waterloo.edu (Jeff Smith)


je>1) Has anybody heard of SQUEEZE? (I'm not a PC user, so I haven't
je>heard of it:-)

SQ is a data compression utility which has hung on from the old CP/M
days.  It implements fairly sophisticated compression techniques
(Huffman encoding ?) which has been surpassed by todays standard
utilites such as PKARC and PKZIP and LHARC.

It is still around in MSDOS form.  It's companion unsqueezer is called
USQ.

I have USQ in MSDOS format (including source code).  SQ is on our local
BBS which I'd be happy to download for you.  Each file is about 12K.
I'd be happy to upload them to you if you can tell me how.  (I'm new to
this USENET stuff and have no FTP access - I access through our local
BBS).  Can I UUENCODE them and include them as text in a message here?
Or via Email?

Be <C>hatting with you.....
---
 * QDeLuxe 1.10 #168

jack.bzoza@canremote.uucp (JACK BZOZA) (07/31/89)

je>A friend of mine has recently come across some compressed
je>files that he says has been packed using a utility called
je>SQUEEZE. Apparently, it leaves an extension with a 'Q' in it.
je>I am not CERTAIN that isn't just the .hqx from a binhex
je>compression, but he seems to be quite certain about squeeze.

Here is a summary of SQueeze and USQueeze which I found on a local BBS
and am posting for all interested parties.

Data Compression Techniques Using SQ and USQ

Mark DiVecchio San Diego IBM PC Users Group

In any computer system, efficient management of the file storage space is
important.  The two programs SQ and USQ reduce the size of data files by using
the Huffman Data Compression Algorithm.

A file is composed of a sequence of bytes.  A byte is composed of 8 bits.  That
byte can have a decimal value between 0 and 255.  A typical text file, like a C
language program, is composed of the ASCII character set (decimal 0 to 127).
This character set only requires seven of the eight bits in a byte.  Just think
-- if there were a way to use that extra bit for another character, you could
reduce the size of the file by 12.5%.  For those of you who use upper case
characters only, you use only about 100 of the ASCII codes from 0 to 255.  You
could save 60% of your storage if the bits not needed within each byte could be
used for other characters.  What if you could encode the most frequently used
characters in the file with only one bit?  For example, if you could store the
letter "e" (the most commonly used letter in the English language) using only
one bit :  "1", you could save 87.5% of the space that the "e" would normally
take if it were stored as the ASCII "0110 0101" binary.

The answer to these questions is the SQ and USQ programs.

The SQ program uses the Huffman coding technique to search for the frequency of
use of each of the 256 possible byte patterns, and it then assigns a
translation for each character to a bit string.  All of these bit strings are
placed end to end and written onto a disk file.  The encoding information is
also put on the file since the USQ program needs to know the character
distribution of the original file.

The USQ program reads in the encoding information and then reads in the encoded
file.  It is a simple matter to scan the encoded file and produce an output
file which is identical to the file that SQ started with.

         Huffman Coding Technique

This is by far the most popular encoding technique in use today.  The Huffman
encoding replaces fixed bit characters with variable length bit strings.  The
length of the bit string is roughly inversely proportional to the frequency of
occurrence of the character.  For those of you inclined to such symbolism:

Length of bit  ~= log  (character
     string          2  probability)

The implementation of the algorithm which we will discuss encodes fixed bit
strings of length 8.

This algorithm requires two passes through the input file.  The first pass
builds a table of 256 entries showing the frequency of each occurrence of each
of the 256 possible values for a byte of information.

Once the counting is complete, the algorithm must decide which bit strings to
associate with each of the 256 characters that were found in the file.  Note
that if a particular byte value was never used, no string association is
needed.

The second pass through the input file converts each byte into its encoded
string.  Remember that when the output file is created, the information used
for encoding must also be written on the file for use by the decoding program.

The decoding program reads the encoding information from the file and then
starts reading the bit strings.  As soon as enough bits are read to interpret a
character, that character is written onto the final output file.  See the next
two sections on how SQ and USQ actually implement this.

Even though this article primarily has addresses ASCII input files, there is
nothing which restricts this algorithm to ASCII.  It will work on binary files
(.COM or .EXE) as well.  But since the length of the encoded bit string is
approximately equal to the inverse of the frequency of occurrence of each 8 bit
byte, a binary file may not compress very much.  This is because a binary file
most likely has a uniform distribution over the 256 values in a byte.  A
machine language program is not like the English language where the letter "e"
is used far more than other letters.  If the distribution is uniform, the
encoded bit strings will all be the same length and the encoded file could be
longer than the original (because of the encoding information on the front of
the file).  All of this has to be qualified, because machine language programs
tend to use a lot of "MOV" instructions and have a lot of bytes of zeros so
that encoding .COM and .EXE files does save some disk space.

                SQ

The SQ program is an example of the Huffman algorithm.

Cont'd Next Message

 
---
 * QDeLuxe 1.10 #168   

jack.bzoza@canremote.uucp (JACK BZOZA) (07/31/89)

Cont'd from last message:

The first thing that SQ does is read through the input file and create a
distribution array for the 256 possible characters.  This array contains counts
of the number of occurrences of each of the 256 characters.  The program counts
these values in a 16 bit number.  It makes sure that, if you are encoding a big
file, counts do not exceed a 16 bit value.  This is highly unlikely, but it
must be accounted for.

At the same time, SQ removes strings of identical characters and replaces them
with the ASCII character DLE followed by a character count of 2-255.  SQ
replaces the ASCII DLE with the pair of characters:  DLE DLE.  This is not
related to the Huffman algorithm but just serves to compress the file a little
more.

Once SQ has scanned the input file, it creates a binary tree structure
containing this frequency occurrence information.  The most frequently
occurring characters have the shortest path from the root to the node, and the
least frequently occurring characters have the longest path.  For example, if
your file were:

ABRACADABRA (a very simple and magical example)

The table of frequency of occurrences would be:

 Letter         # of Occurrences
 ------         ---------------
    A                 5
    B                 2
    C                 1
    D                 1
    R                 2
    all the rest      0

Since the letter "A" occurs most often, it should have the shortest encoded bit
string.  The letters "C" and "D" should have the longest.  The other characters
which don't appear in the input file don't need to be considered.

SQ would create a binary tree to represent this information.  The tree might
look something like this (for purposes of discussion only):

      root    <---Computer trees are
     /    \      always upside down!
   1       0   <-- This is called a
 /       /   \        node.
A      1       0
     /       /   \  <--This is called
   B       1       0    branch.
         /       /   \
       C       1       0
             /           \
           D               R  <-This
                                is a
                                leaf


From this our encoded bit strings which are kept in a translation table would
be:

 Table Entry  Character  Binary
 -----------  ---------  ------
      1           A      1
      2           B      01
      3           C      001
      4           D      0001
      5           R      0000


The output file would be:

   A B  R    A C   A D    A B  R    A
   ----------------------------------
   1 01 0000 1 001 1 0001 1 01 0000 1
   (binary)

   A1 31 A1
   (hex)

We have reduced the size of your file from ten bytes to three bytes for a 70%
savings.  For this simple example, things aren't that well off since we must
put the binary tree encoding information onto the file as well.  So the file
size grew a lot.  But consider a file with the word ABRACADABRA repeated
100,000 times.  Now the encoding information is going to be a very very small
percentage of the output file and the file will shrink tremendously.

SQ opens the output file and writes out the binary tree information.  Then SQ
rewinds the input file and rereads it from the beginning.  As it reads each
character, it looks into the translation table and outputs the corresponding
bit string.

Cont'd next message

 
---
 * QDeLuxe 1.10 #168   

jack.bzoza@canremote.uucp (JACK BZOZA) (07/31/89)

Cont'd from last message:

SQ is a little more complicated than what I have outlined since it must operate
in the real world of hardware, but this is a fairly complete description of the
algorithm.

             USQ

The USQ program is very straightforward.  It reads in the encoding information
written out by SQ and builds the identical binary tree that SQ used to encode
the file.

USQ then reads the input file as if it were a string of bits.  Starting at the
root of the tree, it traverses one branch of the tree with each input bit.  If
it has reached a leaf, it has a character and that character is written to the
output file.  USQ then starts at the root again with the next bit from the
input file.


       What does it all mean?

Now that we understand the algorithm and a little about how the SQ and USQ
programs work, we can use that knowledge to help us run our systems a little
more efficiently.  (So all of this theory is worth something after all!).

1.  Files must be above a threshold size, or else the output file will be
longer than the input file because of the decoding information put at the
beginning of the compressed data.  We don't know the exact size of the
threshold because the encoding binary tree information depends on the
distribution of the characters in a file.  At least we know to check the size
of the encoded file after we run SQ to make sure our file didn't grow.

2.  Some files will not compress well if they have a uniform distribution of
byte values, for example, .COM or .EXE files.  This is because of the way SQ
builds the tree.  Remember that bytes with the same frequency of occurrence are
at the same depth (usually) in the tree.  So if all of the bytes have the same
depth, the output strings are all the same length.

3.  SQ reads the input file twice.  If you can, use RAM disk at least for the
input file and for both files if you have the room.  The next best case is to
use two floppy drives, one for input and one for output.  This will cause a lot
of disk starts and stops but not much head movement.  Worst case is to use one
floppy drive for both input and output.  This will cause a lot of head movement
as the programs alternate between the input and output files.

Other Compression Techniques

RUN-LENGTH ENCODING

Run-length encoding is a technique whereby sequences of identical bytes are
replaced by the repeated byte and a byte count.  As you might guess, this
method is effective only on very specialized files.  One good candidate is a
screen display buffer.  A screen is made up mostly of "spaces".  A completely
blank line could be reduced from 80 bytes of spaces to one space followed by a
value of 80.  To go from 80 bytes down to two bytes is a savings of almost 98%.
You might guess that for text files or binary files, this technique does not
work well at all.

ADAPTIVE COMPRESSION

This technique replaces strings of characters of code.  For example, the string
"ABRACADABRA" would be replaced by a code.  Typical algorithms use a 12 bit
code.  The algorithm is unique in that it only requires a single pass through
the input file as the encoding is taking place.  The current incarnation of
this procedure is called the LZW method (after co-inventors; A.  Lempel, J.
Ziv and T.  Welch).  This algorithm claims a savings of 66% on machine language
files and up to 83% on COBOL files.

Other Reading

If you are interested in reading more about data compression techniques, you
may be interested in these articles:

H.K.  Reghbati, "An Overview of Data Compression Techniques," Computer
Magazine, Vol.  14, No.  4, April 1981, pp.  71-76.

T.A.  Welch, "A Technique for High-Performance Data Compression", Computer
Magazine, Vol 17, No.  6, June 1984, pp.  8-19.

J.  Ziv and A.  Lempel, "A Universal Algorithm for Sequential Data
Compression," IEEE Transactions on Information Theory, Vol.  It-23, No.  3, May
1977, pp.  337-343.


END OF MESSAGE

 
---
 * QDeLuxe 1.10 #168  Be <C>hatting with you....