[net.misc] Perils of Nutrasweet: digits of precision

inc@fluke.UUCP (Pritchard Z. Malevolence) (08/19/85)

Now, Dear Friends, I don't want to get into the "moderator" business, and
this posting is certainly NOT going to be a collector's item, but isn't it
interesting that one's  expectations of bizareness here make nearly anything
one sees or says seem bizarre? It can be fun to read other newsgroups after
this one, because you take a different perspective along with you. For
example, here's what I saw on net.math tonight:


Article 1398 of 1404, Wed 07:49.
Subject: Re: The Perils of Nutrasweet: digits of precision
From: <name and org variables deleted for some odd reason>
Newsgroups: net.med, net.math
(17 lines) More? [ynq]

[I had to 'q' out of that one...this controversy has been raging for *weeks*]

Anyway, I did snatch onto some actual postings from the Subject: Data
Compression and Information Theory. Now in my opinion, we *NEED* data
compression. If this newsgroup isn't the most likely candidate ever for a
virtual memory, then hell, I'll eat my shorts. Now it may seem bizarre to
you, but these guys are making \sensible\ postings:


In article <names and organizations deleted for some odd reason> writes:
>I have two sentence beginnings for you to guess the next letter of:
>1) "A" and 2) "".  In comparison to simple Huffman encodings, I don't expect
>very many people to get it right within 5 guesses.

1) blank; "n"; "r"; "t"; "f"

2) "T"; "A"; "I"; "W"; "F"

These won't cover 90% of the sentences beginning as indicated, but I think
they will get a clear majority.  And, of course, these are among the hardest
cases.  Try, for example, to guess the next letter in these sentences:
1) "This is th" and 2) "Where are you g".

>Are guesses really desirable in a compression system anyway ?  There's is
>no-one to say whether the guess was right or wrong ...

Sure there is; there is the actual document.  You have "guessing"
algorithm, which always produces the same guesses for the same text.

>An obvious argument against one-bit-per-character goes something like this :
>The average word has (say) five characters, which would imply that its
>information content can be represented with 5 bits.  This in turn would
>imply that there are around 2^5, or 32, valid words.  Rather limited.
>This is my interpretation of what one-bit-per-character means; if I have
>missed something, please correct me.

Yes, you have missed quite a bit.  First of all, you aren't justified in
using the average word size.  A better estimate would be that most words
are not more than about eight letters, so the total number is 2^9-1, or 511.
Again, this is not the number of valid words, but the typical number of
words to cover a majority of the possibilities in a particular context.
This seems quite plausible to me.


(Me too.)

(And so it goes...these guys are forging the way to our future, Mates! They
are the avant garde -- right out there in front battling the slugs with
salt-pellet machine guns, gopping year old whipped cream all over the enemy.
The data in this newsgroup should be *SEVERELY* compressed! Here's some more
of the discussion on how to do it:

>All this doesn't prove that one bit per letter is the appropriate estimate,
>but it is much more plausible than your analysis would suggest.

I have to admit my analysis was a bit tongue in cheek.  <deleted>'s estimate
is probably more reasonable, but I still wouldn't want to be restricted
to just 511 words.  Stavros mentioned a system operating on 2 bits per
character (heaps of words).  I find this plausible.

(So do I.)
(But he goes on:)

What <deleted> claims in that paper is that the *redundancy* of English, as
measured by a variety of methods (one of which was character guessing) is
roughly 50%. This means that the amount of information carried in English
text is roughly 50% of the maximum possible with the same alphabet.
Shannon's alphabet was 26 letters plus word space, so a rough calculatio
says about 2.4 bits per character in English. If you use six letter words I
think you'll find this gives you an adequate numberr of words (thirty
thousand or so).

As far as the number of preceding characters we are allowed to see before we
guess, properly it's an infinite number, since the information content is a
property of a statistical ensemble of strings, and we hope everything is
ergodic so that instead of an infinite number of strings we can think about
an infinite length string. Of course this just means a string whose length
gets large, and I think in this context one can do very well with eight or
so characters.

(Well, *this* one certainly can!)

If you want to do some quantitative measurements I think I posted something
about this earlier; you could actually look at substrings of length n, and
see how rapidly the informatio per character converges as n gets large.

(and he has the *cutest* signoff, to wit:)

        Never eat anything with a shelf life of more than ten years

(Good advice. This now well-established principle was first introduced to
the network at large, by, right here, yours truly, NET-DOT-BIZARRO. This is
not to say that others mayn't have had the same experiences with foodstuffs,
but almost certainly didn't pay as much attention to those things as the high
caliber types who read net.dot.brassiere. I think we should get a
well-deserved cuff on the cheek for that, don't you agree?)



-- 

/s/ Pritchard Z. Malevolence,
    Assistant Dispatcher,
    Ace Trucking and Bugle