[net.math] Data compression and information theory

norman@lasspvax.UUCP (Norman Ramsey) (07/31/85)

I have been thinking for some time about data compression, both for
archiving and for transmission over a serial link. Forgive me if I post
this both in net.math and net.micro, because I'm interested in both
information theoretic aspects and in software.
 
I have read somewhere that the information content of English text (such
as one might read in the New York Times) is roughly one bit per character.
I have also worked out (it is fairly simple) that a single-character 
Huffman encoding of English text requires about five bits per character.
One assumes that larger savings could be realized by using digraph or
trigraph Huffman encoding, but such schemes rapidly become unweildy for
a computer implementation (unless I am missing something).

So what can we do to get closer to that 1-bit-per-character theoretical
limit.
 
 
On a more computational note: wouldn't it be nice if we could compress
text to be sent over a serial line? If I could write a TTY driver that
would encode everything sent by a host machine to my home computer, I
could read the news many times faster than I can now (which, at 300
baud, makes quite a difference).
 
What can people tell me about data compression or encoding schemes other
than HUffman encoding?
 
What are the possibilities for implementing such a scheme for mainframe->
micro communication and speeding up throughput by a factor of four?
 
 
Norman
...allegra!cornell!lasspvax!norman
 
P.S. I am new to the net so if I've goofed send me some mail and let me 
know.

joel@peora.UUCP (Joel Upchurch) (07/31/85)

>What can people tell me about data compression or encoding schemes other
>than HUffman encoding?

One way to do this is to build a dictionary of 50,000 or so words,
then you can compress the text by representing each word as a half-word
which stands for the nth word in the dictionary. You can also achieve
more compression by putting frequently used phrases in the dictionary
too, but that might increase compression time. It you assume that the
average word has 6 characters that gives you between 2 and 3 bits per
character.

I think I read about this idea a long time ago in one of Wayne Green's
editorials in Kilobaud.

If you have a machine with enough memory to keep the dictionary in
memory, then compression and decompression of the text should be
very rapid.

I read an announcement from, I think, Sperry, that was for a machine
that could do extremely rapid text retievals and key-word searches.
From the description I think they may be using a scheme similar to
this. As you can see, key-word searches of the compressed text would
be extremely rapid with this scheme.

This approach wouldn't be very useful for sending data over serial
lines unless both ends had the same dictionary.

dmt@mtgzz.UUCP (d.m.tutelman) (08/02/85)

> I have read somewhere that the information content of English text (such
> as one might read in the New York Times) is roughly one bit per character.
> I have also worked out (it is fairly simple) that a single-character 
> Huffman encoding of English text requires about five bits per character.
> One assumes that larger savings could be realized by using digraph or
> trigraph Huffman encoding, but such schemes rapidly become unweildy for
> a computer implementation (unless I am missing something).

The one-bit-per-character assertion comes from an old classic paper.
(Don't have a reference handy, but I believe it's by Claude Shannon
himself, published in BSTJ in the 1950s or even '40s.)
What he did was have human beings "guess" the next letter in
meaningful (i.e. - not nonsense) English sentences. "Space" was included
as a character.  He then computed the probability distribution
of the number of guesses required to correctly guess the next
letter.  The entropy of that distribution was a smidgen over one bit
(i.e.- people usually guessed right the first try).  

How do you use this result in a "real" encoder?  First, you build a
"guesser" that knows as much about the language and human experience
as the humans in Shannon's experiment. (Obviously that's the 
HARD PART.)  Feed it the string and count the number of guesses
it takes.  Huffman-code this number, based on the performance
distribution of the "guesser".  If the guesser is as good as Shannon's
subjects were, it'll be about one bit per letter.
The decoder is the inverse operation, using the identical algorithm
in the "guesser".  (If there is ANY difference in the number of
guesses, the reconstruction will quickly go out of sync.)

Implementation is still a far-off AI research project.

			Dave Tutelman
			Physical - AT&T Information Systems
				   Holmdel, NJ 07733
			Logical  - ...ihnp4!mtuxo!mtgzz!dmt
			Audible  - (201)-834-2895

bjpt@mulga.OZ (Benjamin Thompson) (08/06/85)

In article <1010@mtgzz.UUCP> version B 2.10.2 (MU) 9/23/84; site mulga.OZ version B 2.10.PCS 1/10/84; site mtgzz.UUCP mulga!munnari!seismo!harvard!think!mit-eddie!genrad!decvax!tektronix!uw-beaver!cornell!vax135!ariel!mtunf!mtunh!mtuxo!mtgzz!dmt dmt@mtgzz.UUCP (d.m.tutelman) writes:
>The one-bit-per-character assertion comes from an old classic paper.
>(Don't have a reference handy, but I believe it's by Claude Shannon
>himself, published in BSTJ in the 1950s or even '40s.)
>What he did was have human beings "guess" the next letter in
>meaningful (i.e. - not nonsense) English sentences. "Space" was included
>as a character.
But how much of each sentence was made available before guesses were made ?
I have two sentence beginnings for you to guess the next letter of:
1) "A" and 2) "".  In comparison to simple Huffman encodings, I don't expect
very many people to get it right within 5 guesses.

Are guesses really desirable in a compression system anyway ?  There's is
no-one to say whether the guess was right or wrong ...

An obvious argument against one-bit-per-character goes something like this :
The average word has (say) five characters, which would imply that its
information content can be represented with 5 bits.  This in turn would
imply that there are around 2^5, or 32, valid words.  Rather limited.
This is my interpretation of what one-bit-per-character means; if I have
missed something, please correct me.

Ben Thompson				seismo!munnari!bjpt

franka@mmintl.UUCP (Frank Adams) (08/07/85)

In article <854@mulga.OZ> bjpt@mulga.OZ (Benjamin Thompson) writes:
>I have two sentence beginnings for you to guess the next letter of:
>1) "A" and 2) "".  In comparison to simple Huffman encodings, I don't expect
>very many people to get it right within 5 guesses.

1) blank; "n"; "r"; "t"; "f"

2) "T"; "A"; "I"; "W"; "F"

These won't cover 90% of the sentences beginning as indicated, but I think
they will get a clear majority.  And, of course, these are among the hardest
cases.  Try, for example, to guess the next letter in these sentences:
1) "This is th" and 2) "Where are you g".

>Are guesses really desirable in a compression system anyway ?  There's is
>no-one to say whether the guess was right or wrong ...

Sure there is; there is the actual document.  You have "guessing"
algorithm, which always produces the same guesses for the same text.

>An obvious argument against one-bit-per-character goes something like this :
>The average word has (say) five characters, which would imply that its
>information content can be represented with 5 bits.  This in turn would
>imply that there are around 2^5, or 32, valid words.  Rather limited.
>This is my interpretation of what one-bit-per-character means; if I have
>missed something, please correct me.

Yes, you have missed quite a bit.  First of all, you aren't justified in
using the average word size.  A better estimate would be that most words
are not more than about eight letters, so the total number is 2^9-1, or 511.
Again, this is not the number of valid words, but the typical number of
words to cover a majority of the possibilities in a particular context.
This seems quite plausible to me.

All this doesn't prove that one bit per letter is the appropriate estimate,
but it is much more plausible than your analysis would suggest.

bjpt@mulga.OZ (Benjamin Thompson) (08/09/85)

In article <570@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
>>Are guesses really desirable in a compression system anyway ?  There's is
>>no-one to say whether the guess was right or wrong ...
>
>Sure there is; there is the actual document.  You have "guessing"
>algorithm, which always produces the same guesses for the same text.

The idea of compression is usually to save space (although it doesn't always
work that way - my supervisor was testing out a coding scheme and managed
to achieve a 600% increase!  This occurred because she was outputting bits as
characters rather than bits ...).  As such, keeping the original document
around is a) against the rules and b) not always possible (e.g. remote file
transfer).  Guessing is not generally desirable in re-construction.

However, after talking with Stavros Macrakis (seismo!harvard!macrakis) about
this, my view of what Shannon was doing has changed a bit.  It is effectively
finding probabilities of certain characters in certain contexts (I claim
this is basically the probabilites needed in Huffman-like codings, although
I don't think Stavros believes me).  What his subjects were doing is clearly
not feasible for a work-every-time compression system.

>>An obvious argument against one-bit-per-character goes something like this :
>>The average word has (say) five characters, which would imply that its
>>information content can be represented with 5 bits.  This in turn would
>>imply that there are around 2^5, or 32, valid words.  Rather limited.
>>This is my interpretation of what one-bit-per-character means; if I have
>>missed something, please correct me.
>
>Yes, you have missed quite a bit.  First of all, you aren't justified in
>using the average word size.  A better estimate would be that most words
>are not more than about eight letters, so the total number is 2^9-1, or 511.
>Again, this is not the number of valid words, but the typical number of
>words to cover a majority of the possibilities in a particular context.
>This seems quite plausible to me.
>
>All this doesn't prove that one bit per letter is the appropriate estimate,
>but it is much more plausible than your analysis would suggest.

I have to admit my analysis was a bit tongue in cheek.  Frank's estimate
is probably more reasonable, but I still wouldn't want to be restricted
to just 511 words.  Stavros mentioned a system operating on 2 bits per
character (heaps of words).  I find this plausible.

franka@mmintl.UUCP (Frank Adams) (08/13/85)

In article <858@mulga.OZ> bjpt@mulga.OZ (Benjamin Thompson) writes:
>In article <570@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
>>>Are guesses really desirable in a compression system anyway ?  There's is
>>>no-one to say whether the guess was right or wrong ...
>>
>>Sure there is; there is the actual document.  You have "guessing"
>>algorithm, which always produces the same guesses for the same text.
>
>The idea of compression is usually to save space [...]
>  As such, keeping the original document
>around is a) against the rules and b) not always possible (e.g. remote file
>transfer).  Guessing is not generally desirable in re-construction.

Let me clarify what a data compression algorithm based on "guessing"
might look like.  The basic encoding loop looks like:

	for each character in source;
		while (character != guess());
			output '1' bit;
		output '0' bit;

The guess function looks only at the document up to but not including the
current character, and returns the most likely next character the first time,
then the next most likely, etc.

The expansion algorithm is as follows:

	until end of file;
		while (bit '1' read);
			guess();
		output guess();

Note that *any* compression algorithm will sometimes make the result longer
than the input.

>However, after talking with Stavros Macrakis (seismo!harvard!macrakis) about
>this, my view of what Shannon was doing has changed a bit.  It is effectively
>finding probabilities of certain characters in certain contexts (I claim
>this is basically the probabilites needed in Huffman-like codings, although
>I don't think Stavros believes me).  What his subjects were doing is clearly
>not feasible for a work-every-time compression system.

Yes, this is basically the probabilities needed in Huffman-like codings.
My sample algorithm assumes most-likely=1/2, next-most-likely=1/4, etc.
(As such, it is on average two bits per character.  A better result would
require encoding multiple following characters with a single bit in some
instances.)

What his subjects were doing is not feasible for a compression system;
the point is that it suggests that an optimal algorithm might be able
to acheive a one bit per character average.  How to guess what comes
next is the next question, to which there is as yet no answer.

> but I still wouldn't want to be restricted
>to just 511 words.

You are still missing a key point.  You aren't restricted to ~500 words.
It's just that, in any context (preceding portion of the document), there
is a list of about 500 words, such that most of the time, your next word
will be on the list.  You *can* use any word (or fhdias) that you want.

norman@lasspvax.UUCP (Norman Ramsey) (08/13/85)

In article <854@mulga.OZ> bjpt@mulga.OZ (Benjamin Thompson) writes:
>In article <1010@mtgzz.UUCP> version B 2.10.2 (MU) 9/23/84; site mulga.OZ version B 2.10.PCS 1/10/84; site mtgzz.UUCP mulga!munnari!seismo!harvard!think!mit-eddie!genrad!decvax!tektronix!uw-beaver!cornell!vax135!ariel!mtunf!mtunh!mtuxo!mtgzz!dmt dmt@mtgz
>.UUCP (d.m.tutelman) writes:
>>The one-bit-per-character assertion comes from an old classic paper.
>>(Don't have a reference handy, but I believe it's by Claude Shannon
>>himself, published in BSTJ in the 1950s or even '40s.)

What Shannon claims in that paper is that the *redundancy* of English, as
measured by a variety of methods (one of which was character guessing) is
roughly 50%. This means that the amount of information carried in English
text is roughly 50% of the maximum possible with the same alphabet.
Shannon's alphabet was 26 letters plus word space, so a rough calculatio
says about 2.4 bits per character in English. If you use six letter words I
think you'll find this gives you an adequate numberr of words (thirty
thousand or so).

As far as the number of preceding characters we are allowed to see before we
guess, properly it's an infinite number, since the information content is a
property of a statistical ensemble of strings, and we hope everything is
ergodic so that instead of an infinite number of strings we can think about
an infinite length string. Of course this just means a string whose length
gets large, and I think in this context one can do very well with eight or
so characters.

If you want to do some quantitative measurements I think I posted something
about this earlier; you could actually look at substrings of length n, and
see how rapidly the informatio per character converges as n gets large.
-- 
Norman Ramsey

ARPA: norman@lasspvax  -- or --  norman%lasspvax@cu-arpa.cs.cornell.edu
UUCP: {ihnp4,allegra,...}!cornell!lasspvax!norman
BITNET: (in desperation only) ZSYJARTJ at CORNELLA
US Mail: Dept Physics, Clark Hall, Cornell University, Ithaca, New York 14853
Telephone: (607)-256-3944 (work)    (607)-272-7750 (home)

        Never eat anything with a shelf life of more than ten years

macrakis@harvard.ARPA (Stavros Macrakis) (08/14/85)

I do not consider myself to have been correctly quoted in this
discussion.  I don't care to go into it.
	-s