norman@lasspvax.UUCP (Norman Ramsey) (07/31/85)
I have been thinking for some time about data compression, both for archiving and for transmission over a serial link. Forgive me if I post this both in net.math and net.micro, because I'm interested in both information theoretic aspects and in software. I have read somewhere that the information content of English text (such as one might read in the New York Times) is roughly one bit per character. I have also worked out (it is fairly simple) that a single-character Huffman encoding of English text requires about five bits per character. One assumes that larger savings could be realized by using digraph or trigraph Huffman encoding, but such schemes rapidly become unweildy for a computer implementation (unless I am missing something). So what can we do to get closer to that 1-bit-per-character theoretical limit. On a more computational note: wouldn't it be nice if we could compress text to be sent over a serial line? If I could write a TTY driver that would encode everything sent by a host machine to my home computer, I could read the news many times faster than I can now (which, at 300 baud, makes quite a difference). What can people tell me about data compression or encoding schemes other than HUffman encoding? What are the possibilities for implementing such a scheme for mainframe-> micro communication and speeding up throughput by a factor of four? Norman ...allegra!cornell!lasspvax!norman P.S. I am new to the net so if I've goofed send me some mail and let me know.
joel@peora.UUCP (Joel Upchurch) (07/31/85)
>What can people tell me about data compression or encoding schemes other >than HUffman encoding? One way to do this is to build a dictionary of 50,000 or so words, then you can compress the text by representing each word as a half-word which stands for the nth word in the dictionary. You can also achieve more compression by putting frequently used phrases in the dictionary too, but that might increase compression time. It you assume that the average word has 6 characters that gives you between 2 and 3 bits per character. I think I read about this idea a long time ago in one of Wayne Green's editorials in Kilobaud. If you have a machine with enough memory to keep the dictionary in memory, then compression and decompression of the text should be very rapid. I read an announcement from, I think, Sperry, that was for a machine that could do extremely rapid text retievals and key-word searches. From the description I think they may be using a scheme similar to this. As you can see, key-word searches of the compressed text would be extremely rapid with this scheme. This approach wouldn't be very useful for sending data over serial lines unless both ends had the same dictionary.
dmt@mtgzz.UUCP (d.m.tutelman) (08/02/85)
> I have read somewhere that the information content of English text (such > as one might read in the New York Times) is roughly one bit per character. > I have also worked out (it is fairly simple) that a single-character > Huffman encoding of English text requires about five bits per character. > One assumes that larger savings could be realized by using digraph or > trigraph Huffman encoding, but such schemes rapidly become unweildy for > a computer implementation (unless I am missing something). The one-bit-per-character assertion comes from an old classic paper. (Don't have a reference handy, but I believe it's by Claude Shannon himself, published in BSTJ in the 1950s or even '40s.) What he did was have human beings "guess" the next letter in meaningful (i.e. - not nonsense) English sentences. "Space" was included as a character. He then computed the probability distribution of the number of guesses required to correctly guess the next letter. The entropy of that distribution was a smidgen over one bit (i.e.- people usually guessed right the first try). How do you use this result in a "real" encoder? First, you build a "guesser" that knows as much about the language and human experience as the humans in Shannon's experiment. (Obviously that's the HARD PART.) Feed it the string and count the number of guesses it takes. Huffman-code this number, based on the performance distribution of the "guesser". If the guesser is as good as Shannon's subjects were, it'll be about one bit per letter. The decoder is the inverse operation, using the identical algorithm in the "guesser". (If there is ANY difference in the number of guesses, the reconstruction will quickly go out of sync.) Implementation is still a far-off AI research project. Dave Tutelman Physical - AT&T Information Systems Holmdel, NJ 07733 Logical - ...ihnp4!mtuxo!mtgzz!dmt Audible - (201)-834-2895