ross@spam.ua.oz.au (Ross Williams) (03/23/91)
Title: PAINLESS HARD-WIRED ARITHMETIC CODING. Subtitle: How to Code White Noise into Text. Author: Ross Williams. EMail: ross@spam.ua.oz.au Date: 23-Mar-1991. (Abstract: This short article describes a simple way to code a stream of random bits into a stream of text characters.) Text files are a very desirable way to represent information. Although they waste a little space, they seem to be the only things that can be transferred between computers without hassles. For this reason, I have considered the implications of a using text files as a medium for storing compressed data. I assume that only printable bytes [32,126] will be included in the text file, except for the end of line byte (10), which will appear at regular intervals to keep the line length down. Some programs (such as Kermit) have difficulty with long lines (e.g. >2K long). It is also prudent to exclude blanks from the range as some overenthusiastic programs may convert runs of blanks into TAB characters or delete leading or trailing blanks. In the end, we are left with a stream of bytes in the range [33,126]. This is 94 values, a particularly inconvenient range as it is not a power of two. In fact, it is not even close to a power of two, being almost exactly between 64 and 128. It misses out by 2 on being exactly half way and has lots of bits turned on (94=64+16+8+4+2=\p{1011110}). This makes it a very inconvenient range to code a stream of random bits (such as might be generated by a fast LZ technique). As might be expected, all the low powers of 94 are unpleasant numbers too. All in all, a very unpleasant range to code into. My first reaction was to say "No problems, lets slam in an arithmetic coder and everything will be OK" which indeed is what I decided on. However, your typical arithmetic coder ranges in implementation complexity somewhere between debugging AI programs in assembler and simulating a connection machine on a TRS-80. And they can be slow too. However, by hard wiring the arithmetic coding, the implementation can be much less painful than for a full arithmetic coder, which is why I'm bothering to tell you all this. To hardwire the arithmetic coding, form a mapping from the range [0,63] (6 bits exactly) onto the range [0,93] such that 34 of the 64 values map onto one value of the range [0,93] and 30 of the 64 values map onto two values. It doesn't matter exactly what this mapping is. Probably the simplest mapping is [0,33]-->[0,33] and [34,63]-->[34,93] (with 34-->(34,35), 35-->(36,37) etc). If desired, this mapping could be implemented using a lookup table. Now a completely random bit stream arising from a standard compression algorithm (e.g. unix compress or some other LZ algorithm), can be coded as follows. Take the next six bits of the data and map it. If the six-bit number maps to a single number, that number is transmitted. If the six bit number maps to a pair of numbers, the next input bit is used to determine which of the pair will be transmitted. The procedure is reversed at the receiving end. The efficiency of this scheme can be calculated as follows. The amount of information in a 94 symbol range is log_2(94)=6.555 bits. If the input bit stream is truly random (as it should be from a perfect compression algorithm), 6 bits will be transferred per text character 34/64 of the time, and 7 bits will be transferred per text character 30/64 of the time. This makes (34/64)*6+(30/64)*7=6.469 bits. This is an efficiency of 6.469/6.555=0.987 which is quite good. This scheme will be reasonably fast to implement, the main problem being all the bit manipulation. If you want it to go fast you can hard-wire the eight bit-alignment configurations into your program so as to deal with each alignment situation optimally. I have not actually implemented this scheme, but am likely to sometime in the next few months as part of a bigger project. I hope someone out there finds this simple scheme useful, Keep your entropy low, Ross Williams ross@spam.ua.oz.au
rickert@mp.cs.niu.edu (Neil Rickert) (03/23/91)
In article <550@spam.ua.oz> ross@spam.ua.oz.au (Ross Williams) writes: > >(Abstract: This short article describes a simple way to code a stream >of random bits into a stream of text characters.) > I think you would do better with Huffman coding. Or, more precisely, with Huffman decoding. Take your 94 printable characters (any other number of characters will work). Determine a fixed Huffman encoding for data that uses these characters uniformly. Announce the encoding. To encode a bit stream, do the appropriate Huffman decoding, resulting in a byte stream using these 94 characters. To reproduce the bit stream, do the Huffman encoding. The nice part is that it allows bit streams to be transparently sent from an ASCII site to an EBCDIC site. The Huffman table at each site reflect the character set actually used. (Of course use only characters for which the ASCII to EBCDIC conversion is universally agreed to). Some care is probably needed to add padding bits if needbe, and remove them on reconstruction of the bitstream, but a leading bit count could handle this. Of course, this method only works if the Huffman table to be used is agreed to by sender and receiver. -- =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*= Neil W. Rickert, Computer Science <rickert@cs.niu.edu> Northern Illinois Univ. DeKalb, IL 60115 +1-815-753-6940
jaffer@zurich.ai.mit.edu (Aubrey Jaffer) (03/24/91)
I would also very much like to be able to encode to printable characters. My reasoning went very much like yours except that it makes sense to encode the length-offset pairs of a LZ scheme directly into pairs of printable characters by use of a table. If 93 characters are available, for instance, 93^2 -> (31x3)^2 which can be split into a range of 31^2 -> 961 for the offset and 3^2 -> 9 for the length field. These ranges are reasonable for encoding as I understand it. By eliminating the extra translations step the only penalty is the size of the tranlsation table (about 64k).
brad@looking.on.ca (Brad Templeton) (03/25/91)
If the encoded data is not random data (ie. the output of a compressor) then consider the ABE encoding. You can find abe in the comp.sources.misc archives, and on uunet in ~ftp/clarinet/abe.tar.Z, I think. ABE works with two versions of the safe character set. The more restrictive one is EBCDIC proof. It splits the bytes into 3 or 4 sets, and uses shift characters to change sets. If possible, it puts the most common bytes in the first (default) set. On text files, for example, ABE encodes with almost no expansion, and the encoding is readable! This also applies to binaries, as it always tries to encode the safe characters as themselves when it can. If the bytes are totally random, ABE is more expansive than it should be, but still not too far off what uuencode does. -- Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473
d88-jwa@alv.nada.kth.se (Jon W{tte) (03/25/91)
In article <> brad@looking.on.ca (Brad Templeton) writes:
If the bytes are totally random, ABE is more expansive than it should be,
but still not too far off what uuencode does.
But uuencode ain't very good... (see recent post about atob)
h+@nada.kth.se
Jon W{tte
--
"The IM-IV file manager chapter documents zillions of calls, all of which
seem to do almost the same thing and none of which seem to do what I want
them to do." -- Juri Munkki in comp.sys.mac.programmer