[net.math] Theoretical limits on data compression

norman@lasspvax.UUCP (Norman Ramsey) (08/07/85)

I've been eading up my informatin theory (in Shannon & Weaver, _A
MAthematical Theory of COmmunication_) nd thought I would share things with
the net. Shannon defines a quantity you can call entropy or uncertainty or
information (which I prefer) which is measured in bits. The information
transmitted in a message of length N is 

   sum on i  of  p sub i log2 p sub i

where the sum is over all possible messages of length N and the p sub i is
the probability of observing the ith message. (The log is base 2). If you
divide by N you get a quantity Shannon calls G sub N, which is an *estimate*
if the information carried per character. It's pretty easy to show that G is
a monotonic (decreasing) function of N and that (for a nice well behaved
ergodic process) it converges to H, which is the actual information per
character present in the (statistical ensemble of) messages.

There is a sightly better estimator Shannon calls F sub N:

   F sub N = N G sub N - (N-1) G sub N-1

Both os these give an upper bound on the information content.



As far as the informatin content of English goes, Shannon states that
according to several different measruements, the *redundancy* of English is
about 50%. If I understand his definition correctly, that means that the
informatin content of English is about half the maximum informatin per
character that could be sent *using the same alphabet*. Shannons alphabet is
26 letters and word space, or about four and three-quartes bits. So two and
a half bits per character for English seems more on the money than the one
bit I mentined earlier.

It would be very interesting to write some programs to estimte the entropy
sontained in surce, English, etcetera. It's pretty easy to count single
characters (G sub 1), not too hard to count digrams (G sub 2), but beyond
that things get complicated pretty fast. I would like to hear abot some
schemes for trying to get decent estimates for, say, up to G sub 8. One
problem is that very rare sequences won't show up except in conting a very
large number of cases (divergence is no problem since p log p goes to zero
as p goes to zero). My guess is that the LZ compression scheme could be
adapted to give a pretty good estimate of the entropy. 

Ideas, anyone?


-- 
Norman Ramsey
ARPA: norman@lasspvax  -- or --  norman%lasspvax@cu-arpa.cs.cornell.edu
UUCP: {ihnp4,allegra,...}!cornell!lasspvax!norman
BITNET: (in desperation only) ZSYJARTJ at CORNELLA
        Never eat anything with a shelf life of more than ten years