inc@fluke.UUCP (Pritchard Z. Malevolence) (08/19/85)
Now, Dear Friends, I don't want to get into the "moderator" business, and this posting is certainly NOT going to be a collector's item, but isn't it interesting that one's expectations of bizareness here make nearly anything one sees or says seem bizarre? It can be fun to read other newsgroups after this one, because you take a different perspective along with you. For example, here's what I saw on net.math tonight: Article 1398 of 1404, Wed 07:49. Subject: Re: The Perils of Nutrasweet: digits of precision From: <name and org variables deleted for some odd reason> Newsgroups: net.med, net.math (17 lines) More? [ynq] [I had to 'q' out of that one...this controversy has been raging for *weeks*] Anyway, I did snatch onto some actual postings from the Subject: Data Compression and Information Theory. Now in my opinion, we *NEED* data compression. If this newsgroup isn't the most likely candidate ever for a virtual memory, then hell, I'll eat my shorts. Now it may seem bizarre to you, but these guys are making \sensible\ postings: In article <names and organizations deleted for some odd reason> writes: >I have two sentence beginnings for you to guess the next letter of: >1) "A" and 2) "". In comparison to simple Huffman encodings, I don't expect >very many people to get it right within 5 guesses. 1) blank; "n"; "r"; "t"; "f" 2) "T"; "A"; "I"; "W"; "F" These won't cover 90% of the sentences beginning as indicated, but I think they will get a clear majority. And, of course, these are among the hardest cases. Try, for example, to guess the next letter in these sentences: 1) "This is th" and 2) "Where are you g". >Are guesses really desirable in a compression system anyway ? There's is >no-one to say whether the guess was right or wrong ... Sure there is; there is the actual document. You have "guessing" algorithm, which always produces the same guesses for the same text. >An obvious argument against one-bit-per-character goes something like this : >The average word has (say) five characters, which would imply that its >information content can be represented with 5 bits. This in turn would >imply that there are around 2^5, or 32, valid words. Rather limited. >This is my interpretation of what one-bit-per-character means; if I have >missed something, please correct me. Yes, you have missed quite a bit. First of all, you aren't justified in using the average word size. A better estimate would be that most words are not more than about eight letters, so the total number is 2^9-1, or 511. Again, this is not the number of valid words, but the typical number of words to cover a majority of the possibilities in a particular context. This seems quite plausible to me. (Me too.) (And so it goes...these guys are forging the way to our future, Mates! They are the avant garde -- right out there in front battling the slugs with salt-pellet machine guns, gopping year old whipped cream all over the enemy. The data in this newsgroup should be *SEVERELY* compressed! Here's some more of the discussion on how to do it: >All this doesn't prove that one bit per letter is the appropriate estimate, >but it is much more plausible than your analysis would suggest. I have to admit my analysis was a bit tongue in cheek. <deleted>'s estimate is probably more reasonable, but I still wouldn't want to be restricted to just 511 words. Stavros mentioned a system operating on 2 bits per character (heaps of words). I find this plausible. (So do I.) (But he goes on:) What <deleted> claims in that paper is that the *redundancy* of English, as measured by a variety of methods (one of which was character guessing) is roughly 50%. This means that the amount of information carried in English text is roughly 50% of the maximum possible with the same alphabet. Shannon's alphabet was 26 letters plus word space, so a rough calculatio says about 2.4 bits per character in English. If you use six letter words I think you'll find this gives you an adequate numberr of words (thirty thousand or so). As far as the number of preceding characters we are allowed to see before we guess, properly it's an infinite number, since the information content is a property of a statistical ensemble of strings, and we hope everything is ergodic so that instead of an infinite number of strings we can think about an infinite length string. Of course this just means a string whose length gets large, and I think in this context one can do very well with eight or so characters. (Well, *this* one certainly can!) If you want to do some quantitative measurements I think I posted something about this earlier; you could actually look at substrings of length n, and see how rapidly the informatio per character converges as n gets large. (and he has the *cutest* signoff, to wit:) Never eat anything with a shelf life of more than ten years (Good advice. This now well-established principle was first introduced to the network at large, by, right here, yours truly, NET-DOT-BIZARRO. This is not to say that others mayn't have had the same experiences with foodstuffs, but almost certainly didn't pay as much attention to those things as the high caliber types who read net.dot.brassiere. I think we should get a well-deserved cuff on the cheek for that, don't you agree?) -- /s/ Pritchard Z. Malevolence, Assistant Dispatcher, Ace Trucking and Bugle