[net.micro] Data compression and information theory

norman@lasspvax.UUCP (Norman Ramsey) (07/31/85)

I have been thinking for some time about data compression, both for
archiving and for transmission over a serial link. Forgive me if I post
this both in net.math and net.micro, because I'm interested in both
information theoretic aspects and in software.
 
I have read somewhere that the information content of English text (such
as one might read in the New York Times) is roughly one bit per character.
I have also worked out (it is fairly simple) that a single-character 
Huffman encoding of English text requires about five bits per character.
One assumes that larger savings could be realized by using digraph or
trigraph Huffman encoding, but such schemes rapidly become unweildy for
a computer implementation (unless I am missing something).

So what can we do to get closer to that 1-bit-per-character theoretical
limit.
 
 
On a more computational note: wouldn't it be nice if we could compress
text to be sent over a serial line? If I could write a TTY driver that
would encode everything sent by a host machine to my home computer, I
could read the news many times faster than I can now (which, at 300
baud, makes quite a difference).
 
What can people tell me about data compression or encoding schemes other
than HUffman encoding?
 
What are the possibilities for implementing such a scheme for mainframe->
micro communication and speeding up throughput by a factor of four?
 
 
Norman
...allegra!cornell!lasspvax!norman
 
P.S. I am new to the net so if I've goofed send me some mail and let me 
know.

joel@peora.UUCP (Joel Upchurch) (07/31/85)

>What can people tell me about data compression or encoding schemes other
>than HUffman encoding?

One way to do this is to build a dictionary of 50,000 or so words,
then you can compress the text by representing each word as a half-word
which stands for the nth word in the dictionary. You can also achieve
more compression by putting frequently used phrases in the dictionary
too, but that might increase compression time. It you assume that the
average word has 6 characters that gives you between 2 and 3 bits per
character.

I think I read about this idea a long time ago in one of Wayne Green's
editorials in Kilobaud.

If you have a machine with enough memory to keep the dictionary in
memory, then compression and decompression of the text should be
very rapid.

I read an announcement from, I think, Sperry, that was for a machine
that could do extremely rapid text retievals and key-word searches.
From the description I think they may be using a scheme similar to
this. As you can see, key-word searches of the compressed text would
be extremely rapid with this scheme.

This approach wouldn't be very useful for sending data over serial
lines unless both ends had the same dictionary.

dmt@mtgzz.UUCP (d.m.tutelman) (08/02/85)

> I have read somewhere that the information content of English text (such
> as one might read in the New York Times) is roughly one bit per character.
> I have also worked out (it is fairly simple) that a single-character 
> Huffman encoding of English text requires about five bits per character.
> One assumes that larger savings could be realized by using digraph or
> trigraph Huffman encoding, but such schemes rapidly become unweildy for
> a computer implementation (unless I am missing something).

The one-bit-per-character assertion comes from an old classic paper.
(Don't have a reference handy, but I believe it's by Claude Shannon
himself, published in BSTJ in the 1950s or even '40s.)
What he did was have human beings "guess" the next letter in
meaningful (i.e. - not nonsense) English sentences. "Space" was included
as a character.  He then computed the probability distribution
of the number of guesses required to correctly guess the next
letter.  The entropy of that distribution was a smidgen over one bit
(i.e.- people usually guessed right the first try).  

How do you use this result in a "real" encoder?  First, you build a
"guesser" that knows as much about the language and human experience
as the humans in Shannon's experiment. (Obviously that's the 
HARD PART.)  Feed it the string and count the number of guesses
it takes.  Huffman-code this number, based on the performance
distribution of the "guesser".  If the guesser is as good as Shannon's
subjects were, it'll be about one bit per letter.
The decoder is the inverse operation, using the identical algorithm
in the "guesser".  (If there is ANY difference in the number of
guesses, the reconstruction will quickly go out of sync.)

Implementation is still a far-off AI research project.

			Dave Tutelman
			Physical - AT&T Information Systems
				   Holmdel, NJ 07733
			Logical  - ...ihnp4!mtuxo!mtgzz!dmt
			Audible  - (201)-834-2895