[comp.compression] How to code to a text file.

ross@spam.ua.oz.au (Ross Williams) (03/23/91)

Title: PAINLESS HARD-WIRED ARITHMETIC CODING.
Subtitle: How to Code White Noise into Text.
Author: Ross Williams.
EMail: ross@spam.ua.oz.au
Date: 23-Mar-1991.

(Abstract: This short article describes a simple way to code a stream
of random bits into a stream of text characters.)


Text files are a very desirable way to represent information. Although
they waste a little space, they seem to be the only things that can be
transferred between computers without hassles. For this reason, I have
considered the implications of a using text files as a medium for
storing compressed data.

I assume that only printable bytes [32,126] will be included in the
text file, except for the end of line byte (10), which will appear at
regular intervals to keep the line length down. Some programs (such as
Kermit) have difficulty with long lines (e.g. >2K long). It is also
prudent to exclude blanks from the range as some overenthusiastic
programs may convert runs of blanks into TAB characters or delete
leading or trailing blanks.

In the end, we are left with a stream of bytes in the range [33,126].
This is 94 values, a particularly inconvenient range as it is not a
power of two. In fact, it is not even close to a power of two, being
almost exactly between 64 and 128. It misses out by 2 on being exactly
half way and has lots of bits turned on (94=64+16+8+4+2=\p{1011110}).
This makes it a very inconvenient range to code a stream of random
bits (such as might be generated by a fast LZ technique). As might be
expected, all the low powers of 94 are unpleasant numbers too. All in
all, a very unpleasant range to code into.

My first reaction was to say "No problems, lets slam in an arithmetic
coder and everything will be OK" which indeed is what I decided on.
However, your typical arithmetic coder ranges in implementation
complexity somewhere between debugging AI programs in assembler and
simulating a connection machine on a TRS-80. And they can be slow too.
However, by hard wiring the arithmetic coding, the implementation can
be much less painful than for a full arithmetic coder, which is why
I'm bothering to tell you all this.

To hardwire the arithmetic coding, form a mapping from the range
[0,63] (6 bits exactly) onto the range [0,93] such that 34 of the 64
values map onto one value of the range [0,93] and 30 of the 64 values
map onto two values. It doesn't matter exactly what this mapping is.
Probably the simplest mapping is [0,33]-->[0,33] and [34,63]-->[34,93]
(with 34-->(34,35), 35-->(36,37) etc). If desired, this mapping could
be implemented using a lookup table.

Now a completely random bit stream arising from a standard compression
algorithm (e.g. unix compress or some other LZ algorithm), can be
coded as follows. Take the next six bits of the data and map it. If
the six-bit number maps to a single number, that number is
transmitted. If the six bit number maps to a pair of numbers, the next
input bit is used to determine which of the pair will be transmitted.
The procedure is reversed at the receiving end.

The efficiency of this scheme can be calculated as follows. The amount
of information in a 94 symbol range is log_2(94)=6.555 bits. If the
input bit stream is truly random (as it should be from a perfect
compression algorithm), 6 bits will be transferred per text character
34/64 of the time, and 7 bits will be transferred per text character
30/64 of the time. This makes (34/64)*6+(30/64)*7=6.469 bits. This is
an efficiency of 6.469/6.555=0.987 which is quite good.

This scheme will be reasonably fast to implement, the main problem
being all the bit manipulation. If you want it to go fast you can
hard-wire the eight bit-alignment configurations into your program so
as to deal with each alignment situation optimally.

I have not actually implemented this scheme, but am likely to sometime
in the next few months as part of a bigger project.

I hope someone out there finds this simple scheme useful,

Keep your entropy low,

Ross Williams
ross@spam.ua.oz.au

rickert@mp.cs.niu.edu (Neil Rickert) (03/23/91)

In article <550@spam.ua.oz> ross@spam.ua.oz.au (Ross Williams) writes:
>
>(Abstract: This short article describes a simple way to code a stream
>of random bits into a stream of text characters.)
>
 I think you would do better with Huffman coding.  Or, more precisely, with
Huffman decoding.

 Take your 94 printable characters (any other number of characters will work).
Determine a fixed Huffman encoding for data that uses these characters
uniformly.  Announce the encoding.

 To encode a bit stream, do the appropriate Huffman decoding, resulting in
a byte stream using these 94 characters.  To reproduce the bit stream, do the
Huffman encoding.  The nice part is that it allows bit streams to be
transparently sent from an ASCII site to an EBCDIC site.  The Huffman table
at each site reflect the character set actually used.  (Of course use only
characters for which the ASCII to EBCDIC conversion is universally agreed to).

 Some care is probably needed to add padding bits if needbe, and remove them
on reconstruction of the bitstream, but a leading bit count could handle
this.

 Of course, this method only works if the Huffman table to be used is
agreed to by sender and receiver.

-- 
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
  Neil W. Rickert, Computer Science               <rickert@cs.niu.edu>
  Northern Illinois Univ.
  DeKalb, IL 60115                                   +1-815-753-6940

jaffer@zurich.ai.mit.edu (Aubrey Jaffer) (03/24/91)

I would also very much like to be able to encode to printable
characters.  My reasoning went very much like yours except that it
makes sense to encode the length-offset pairs of a LZ scheme directly
into pairs of printable characters by use of a table.  If 93
characters are available, for instance, 93^2 -> (31x3)^2 which can be
split into a range of 31^2 -> 961 for the offset and 3^2 -> 9 for the
length field.  These ranges are reasonable for encoding as I
understand it.

By eliminating the extra translations step the only penalty is the
size of the tranlsation table (about 64k).

brad@looking.on.ca (Brad Templeton) (03/25/91)

If the encoded data is not random data (ie. the output of a compressor) then
consider the ABE encoding.   You can find abe in the comp.sources.misc
archives, and on uunet in ~ftp/clarinet/abe.tar.Z, I think.

ABE works with two versions of the safe character set.  The more restrictive
one is EBCDIC proof.

It splits the bytes into 3 or 4 sets, and uses shift characters to change sets.
If possible, it puts the most common bytes in the first (default) set.  On
text files, for example, ABE encodes with almost no expansion, and the encoding
is readable!   This also applies to binaries, as it always tries to encode the
safe characters as themselves when it can.

If the bytes are totally random, ABE is more expansive than it should be,
but still not too far off what uuencode does.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

d88-jwa@alv.nada.kth.se (Jon W{tte) (03/25/91)

In article <> brad@looking.on.ca (Brad Templeton) writes:

   If the bytes are totally random, ABE is more expansive than it should be,
   but still not too far off what uuencode does.

But uuencode ain't very good... (see recent post about atob)

						h+@nada.kth.se
						Jon W{tte
--
"The IM-IV file manager chapter documents zillions of calls, all of which
seem to do almost the same thing and none of which seem to do what I want
them to do."  --  Juri Munkki in comp.sys.mac.programmer