norman@lasspvax.UUCP (Norman Ramsey) (08/07/85)
I've been eading up my informatin theory (in Shannon & Weaver, _A MAthematical Theory of COmmunication_) nd thought I would share things with the net. Shannon defines a quantity you can call entropy or uncertainty or information (which I prefer) which is measured in bits. The information transmitted in a message of length N is sum on i of p sub i log2 p sub i where the sum is over all possible messages of length N and the p sub i is the probability of observing the ith message. (The log is base 2). If you divide by N you get a quantity Shannon calls G sub N, which is an *estimate* if the information carried per character. It's pretty easy to show that G is a monotonic (decreasing) function of N and that (for a nice well behaved ergodic process) it converges to H, which is the actual information per character present in the (statistical ensemble of) messages. There is a sightly better estimator Shannon calls F sub N: F sub N = N G sub N - (N-1) G sub N-1 Both os these give an upper bound on the information content. As far as the informatin content of English goes, Shannon states that according to several different measruements, the *redundancy* of English is about 50%. If I understand his definition correctly, that means that the informatin content of English is about half the maximum informatin per character that could be sent *using the same alphabet*. Shannons alphabet is 26 letters and word space, or about four and three-quartes bits. So two and a half bits per character for English seems more on the money than the one bit I mentined earlier. It would be very interesting to write some programs to estimte the entropy sontained in surce, English, etcetera. It's pretty easy to count single characters (G sub 1), not too hard to count digrams (G sub 2), but beyond that things get complicated pretty fast. I would like to hear abot some schemes for trying to get decent estimates for, say, up to G sub 8. One problem is that very rare sequences won't show up except in conting a very large number of cases (divergence is no problem since p log p goes to zero as p goes to zero). My guess is that the LZ compression scheme could be adapted to give a pretty good estimate of the entropy. Ideas, anyone? -- Norman Ramsey ARPA: norman@lasspvax -- or -- norman%lasspvax@cu-arpa.cs.cornell.edu UUCP: {ihnp4,allegra,...}!cornell!lasspvax!norman BITNET: (in desperation only) ZSYJARTJ at CORNELLA Never eat anything with a shelf life of more than ten years