norman@lasspvax.UUCP (Norman Ramsey) (07/31/85)
I have been thinking for some time about data compression, both for archiving and for transmission over a serial link. Forgive me if I post this both in net.math and net.micro, because I'm interested in both information theoretic aspects and in software. I have read somewhere that the information content of English text (such as one might read in the New York Times) is roughly one bit per character. I have also worked out (it is fairly simple) that a single-character Huffman encoding of English text requires about five bits per character. One assumes that larger savings could be realized by using digraph or trigraph Huffman encoding, but such schemes rapidly become unweildy for a computer implementation (unless I am missing something). So what can we do to get closer to that 1-bit-per-character theoretical limit. On a more computational note: wouldn't it be nice if we could compress text to be sent over a serial line? If I could write a TTY driver that would encode everything sent by a host machine to my home computer, I could read the news many times faster than I can now (which, at 300 baud, makes quite a difference). What can people tell me about data compression or encoding schemes other than HUffman encoding? What are the possibilities for implementing such a scheme for mainframe-> micro communication and speeding up throughput by a factor of four? Norman ...allegra!cornell!lasspvax!norman P.S. I am new to the net so if I've goofed send me some mail and let me know.
joel@peora.UUCP (Joel Upchurch) (07/31/85)
>What can people tell me about data compression or encoding schemes other >than HUffman encoding? One way to do this is to build a dictionary of 50,000 or so words, then you can compress the text by representing each word as a half-word which stands for the nth word in the dictionary. You can also achieve more compression by putting frequently used phrases in the dictionary too, but that might increase compression time. It you assume that the average word has 6 characters that gives you between 2 and 3 bits per character. I think I read about this idea a long time ago in one of Wayne Green's editorials in Kilobaud. If you have a machine with enough memory to keep the dictionary in memory, then compression and decompression of the text should be very rapid. I read an announcement from, I think, Sperry, that was for a machine that could do extremely rapid text retievals and key-word searches. From the description I think they may be using a scheme similar to this. As you can see, key-word searches of the compressed text would be extremely rapid with this scheme. This approach wouldn't be very useful for sending data over serial lines unless both ends had the same dictionary.
dmt@mtgzz.UUCP (d.m.tutelman) (08/02/85)
> I have read somewhere that the information content of English text (such > as one might read in the New York Times) is roughly one bit per character. > I have also worked out (it is fairly simple) that a single-character > Huffman encoding of English text requires about five bits per character. > One assumes that larger savings could be realized by using digraph or > trigraph Huffman encoding, but such schemes rapidly become unweildy for > a computer implementation (unless I am missing something). The one-bit-per-character assertion comes from an old classic paper. (Don't have a reference handy, but I believe it's by Claude Shannon himself, published in BSTJ in the 1950s or even '40s.) What he did was have human beings "guess" the next letter in meaningful (i.e. - not nonsense) English sentences. "Space" was included as a character. He then computed the probability distribution of the number of guesses required to correctly guess the next letter. The entropy of that distribution was a smidgen over one bit (i.e.- people usually guessed right the first try). How do you use this result in a "real" encoder? First, you build a "guesser" that knows as much about the language and human experience as the humans in Shannon's experiment. (Obviously that's the HARD PART.) Feed it the string and count the number of guesses it takes. Huffman-code this number, based on the performance distribution of the "guesser". If the guesser is as good as Shannon's subjects were, it'll be about one bit per letter. The decoder is the inverse operation, using the identical algorithm in the "guesser". (If there is ANY difference in the number of guesses, the reconstruction will quickly go out of sync.) Implementation is still a far-off AI research project. Dave Tutelman Physical - AT&T Information Systems Holmdel, NJ 07733 Logical - ...ihnp4!mtuxo!mtgzz!dmt Audible - (201)-834-2895
bjpt@mulga.OZ (Benjamin Thompson) (08/06/85)
In article <1010@mtgzz.UUCP> version B 2.10.2 (MU) 9/23/84; site mulga.OZ version B 2.10.PCS 1/10/84; site mtgzz.UUCP mulga!munnari!seismo!harvard!think!mit-eddie!genrad!decvax!tektronix!uw-beaver!cornell!vax135!ariel!mtunf!mtunh!mtuxo!mtgzz!dmt dmt@mtgzz.UUCP (d.m.tutelman) writes: >The one-bit-per-character assertion comes from an old classic paper. >(Don't have a reference handy, but I believe it's by Claude Shannon >himself, published in BSTJ in the 1950s or even '40s.) >What he did was have human beings "guess" the next letter in >meaningful (i.e. - not nonsense) English sentences. "Space" was included >as a character. But how much of each sentence was made available before guesses were made ? I have two sentence beginnings for you to guess the next letter of: 1) "A" and 2) "". In comparison to simple Huffman encodings, I don't expect very many people to get it right within 5 guesses. Are guesses really desirable in a compression system anyway ? There's is no-one to say whether the guess was right or wrong ... An obvious argument against one-bit-per-character goes something like this : The average word has (say) five characters, which would imply that its information content can be represented with 5 bits. This in turn would imply that there are around 2^5, or 32, valid words. Rather limited. This is my interpretation of what one-bit-per-character means; if I have missed something, please correct me. Ben Thompson seismo!munnari!bjpt
franka@mmintl.UUCP (Frank Adams) (08/07/85)
In article <854@mulga.OZ> bjpt@mulga.OZ (Benjamin Thompson) writes: >I have two sentence beginnings for you to guess the next letter of: >1) "A" and 2) "". In comparison to simple Huffman encodings, I don't expect >very many people to get it right within 5 guesses. 1) blank; "n"; "r"; "t"; "f" 2) "T"; "A"; "I"; "W"; "F" These won't cover 90% of the sentences beginning as indicated, but I think they will get a clear majority. And, of course, these are among the hardest cases. Try, for example, to guess the next letter in these sentences: 1) "This is th" and 2) "Where are you g". >Are guesses really desirable in a compression system anyway ? There's is >no-one to say whether the guess was right or wrong ... Sure there is; there is the actual document. You have "guessing" algorithm, which always produces the same guesses for the same text. >An obvious argument against one-bit-per-character goes something like this : >The average word has (say) five characters, which would imply that its >information content can be represented with 5 bits. This in turn would >imply that there are around 2^5, or 32, valid words. Rather limited. >This is my interpretation of what one-bit-per-character means; if I have >missed something, please correct me. Yes, you have missed quite a bit. First of all, you aren't justified in using the average word size. A better estimate would be that most words are not more than about eight letters, so the total number is 2^9-1, or 511. Again, this is not the number of valid words, but the typical number of words to cover a majority of the possibilities in a particular context. This seems quite plausible to me. All this doesn't prove that one bit per letter is the appropriate estimate, but it is much more plausible than your analysis would suggest.
bjpt@mulga.OZ (Benjamin Thompson) (08/09/85)
In article <570@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes: >>Are guesses really desirable in a compression system anyway ? There's is >>no-one to say whether the guess was right or wrong ... > >Sure there is; there is the actual document. You have "guessing" >algorithm, which always produces the same guesses for the same text. The idea of compression is usually to save space (although it doesn't always work that way - my supervisor was testing out a coding scheme and managed to achieve a 600% increase! This occurred because she was outputting bits as characters rather than bits ...). As such, keeping the original document around is a) against the rules and b) not always possible (e.g. remote file transfer). Guessing is not generally desirable in re-construction. However, after talking with Stavros Macrakis (seismo!harvard!macrakis) about this, my view of what Shannon was doing has changed a bit. It is effectively finding probabilities of certain characters in certain contexts (I claim this is basically the probabilites needed in Huffman-like codings, although I don't think Stavros believes me). What his subjects were doing is clearly not feasible for a work-every-time compression system. >>An obvious argument against one-bit-per-character goes something like this : >>The average word has (say) five characters, which would imply that its >>information content can be represented with 5 bits. This in turn would >>imply that there are around 2^5, or 32, valid words. Rather limited. >>This is my interpretation of what one-bit-per-character means; if I have >>missed something, please correct me. > >Yes, you have missed quite a bit. First of all, you aren't justified in >using the average word size. A better estimate would be that most words >are not more than about eight letters, so the total number is 2^9-1, or 511. >Again, this is not the number of valid words, but the typical number of >words to cover a majority of the possibilities in a particular context. >This seems quite plausible to me. > >All this doesn't prove that one bit per letter is the appropriate estimate, >but it is much more plausible than your analysis would suggest. I have to admit my analysis was a bit tongue in cheek. Frank's estimate is probably more reasonable, but I still wouldn't want to be restricted to just 511 words. Stavros mentioned a system operating on 2 bits per character (heaps of words). I find this plausible.
franka@mmintl.UUCP (Frank Adams) (08/13/85)
In article <858@mulga.OZ> bjpt@mulga.OZ (Benjamin Thompson) writes: >In article <570@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes: >>>Are guesses really desirable in a compression system anyway ? There's is >>>no-one to say whether the guess was right or wrong ... >> >>Sure there is; there is the actual document. You have "guessing" >>algorithm, which always produces the same guesses for the same text. > >The idea of compression is usually to save space [...] > As such, keeping the original document >around is a) against the rules and b) not always possible (e.g. remote file >transfer). Guessing is not generally desirable in re-construction. Let me clarify what a data compression algorithm based on "guessing" might look like. The basic encoding loop looks like: for each character in source; while (character != guess()); output '1' bit; output '0' bit; The guess function looks only at the document up to but not including the current character, and returns the most likely next character the first time, then the next most likely, etc. The expansion algorithm is as follows: until end of file; while (bit '1' read); guess(); output guess(); Note that *any* compression algorithm will sometimes make the result longer than the input. >However, after talking with Stavros Macrakis (seismo!harvard!macrakis) about >this, my view of what Shannon was doing has changed a bit. It is effectively >finding probabilities of certain characters in certain contexts (I claim >this is basically the probabilites needed in Huffman-like codings, although >I don't think Stavros believes me). What his subjects were doing is clearly >not feasible for a work-every-time compression system. Yes, this is basically the probabilities needed in Huffman-like codings. My sample algorithm assumes most-likely=1/2, next-most-likely=1/4, etc. (As such, it is on average two bits per character. A better result would require encoding multiple following characters with a single bit in some instances.) What his subjects were doing is not feasible for a compression system; the point is that it suggests that an optimal algorithm might be able to acheive a one bit per character average. How to guess what comes next is the next question, to which there is as yet no answer. > but I still wouldn't want to be restricted >to just 511 words. You are still missing a key point. You aren't restricted to ~500 words. It's just that, in any context (preceding portion of the document), there is a list of about 500 words, such that most of the time, your next word will be on the list. You *can* use any word (or fhdias) that you want.
norman@lasspvax.UUCP (Norman Ramsey) (08/13/85)
In article <854@mulga.OZ> bjpt@mulga.OZ (Benjamin Thompson) writes: >In article <1010@mtgzz.UUCP> version B 2.10.2 (MU) 9/23/84; site mulga.OZ version B 2.10.PCS 1/10/84; site mtgzz.UUCP mulga!munnari!seismo!harvard!think!mit-eddie!genrad!decvax!tektronix!uw-beaver!cornell!vax135!ariel!mtunf!mtunh!mtuxo!mtgzz!dmt dmt@mtgz >.UUCP (d.m.tutelman) writes: >>The one-bit-per-character assertion comes from an old classic paper. >>(Don't have a reference handy, but I believe it's by Claude Shannon >>himself, published in BSTJ in the 1950s or even '40s.) What Shannon claims in that paper is that the *redundancy* of English, as measured by a variety of methods (one of which was character guessing) is roughly 50%. This means that the amount of information carried in English text is roughly 50% of the maximum possible with the same alphabet. Shannon's alphabet was 26 letters plus word space, so a rough calculatio says about 2.4 bits per character in English. If you use six letter words I think you'll find this gives you an adequate numberr of words (thirty thousand or so). As far as the number of preceding characters we are allowed to see before we guess, properly it's an infinite number, since the information content is a property of a statistical ensemble of strings, and we hope everything is ergodic so that instead of an infinite number of strings we can think about an infinite length string. Of course this just means a string whose length gets large, and I think in this context one can do very well with eight or so characters. If you want to do some quantitative measurements I think I posted something about this earlier; you could actually look at substrings of length n, and see how rapidly the informatio per character converges as n gets large. -- Norman Ramsey ARPA: norman@lasspvax -- or -- norman%lasspvax@cu-arpa.cs.cornell.edu UUCP: {ihnp4,allegra,...}!cornell!lasspvax!norman BITNET: (in desperation only) ZSYJARTJ at CORNELLA US Mail: Dept Physics, Clark Hall, Cornell University, Ithaca, New York 14853 Telephone: (607)-256-3944 (work) (607)-272-7750 (home) Never eat anything with a shelf life of more than ten years
macrakis@harvard.ARPA (Stavros Macrakis) (08/14/85)
I do not consider myself to have been correctly quoted in this discussion. I don't care to go into it. -s