uzun@pnet01.cts.com (Roger Uzun) (10/02/89)
I am working on a Huffman encoding program and, since the 1st 256 bit patterns used by huffman for the most common bytes in a file are constant, I was wondering if anyone had them and could post them for me? I do not want to create a tree for these values since they are constant and I imagine someone has already done this. Thanks Much -Roger UUCP: {hplabs!hp-sdd ucsd nosc}!crash!pnet01!uzun ARPA: crash!pnet01!uzun@nosc.mil INET: uzun@pnet01.cts.com
uzun@pnet01.cts.com (Roger Uzun) (10/03/89)
Of course i realize that the huffman tree for a file that has 256 unique elements is not optimal for files that have less than that, but I am trying to simplify things a bit. So I should have asked for the huffman bit patterns and bit counts for the constant encoding tree that would be used on files that have 256 unique elements. -Roger uzun UUCP: {hplabs!hp-sdd ucsd nosc}!crash!pnet01!uzun ARPA: crash!pnet01!uzun@nosc.mil INET: uzun@pnet01.cts.com
morgan@camtwh.UUCP (Morgan W. Jones) (10/03/89)
In article <467@crash.cts.com> uzun@pnet01.cts.com (Roger Uzun) writes: >Of course i realize that the huffman tree for a file that has >256 unique elements is not optimal for files that have less than that, >but I am trying to simplify things a bit. >So I should have asked for the huffman bit patterns and bit counts >for the constant encoding tree that would be used on files that >have 256 unique elements. You realize of course that a huffman tree generated for a binary file would be rather useless for a text file (or at least you'd get very little compression). If you wanted to do a fixed compression in one pass without determining the type of the file, you might want to initialize the char count to all ones (rather than all zeros) and count the number of occurances in the first two blocks (1024 bytes) of the file, build your tree and compression table (which is very easy if you know anything at all about trees), and then compress the file. -- Morgan W. Jones (morgan@camtwh) Orama Incorporated, Toronto, Canada. (416) 369-5088
ccplumb@rose.waterloo.edu (Colin Plumb) (10/04/89)
In article <467@crash.cts.com> uzun@pnet01.cts.com (Roger Uzun) writes: >So I should have asked for the huffman bit patterns and bit counts >for the constant encoding tree that would be used on files that >have 256 unique elements. Er... one of us is confused. Huffman encoding relies on knowing the static distribution of however many values you're trying to represent. Without these probabilities, you can't create a Huffman encoding. Supposing you do have a set of probabilities, p_0, p_1, p_2,...,p_n, to get an optimal Huffman tree, sort them, take the two least likely elements of the list, and create a common parent for them. The probability of this parent is the sum of the probabilities of its children, and the encodings of its children are the encodings of the parent followed by "0" and "1". Remove the two child nodes from the set of probabilities and add the parent, preserving the sorting. Repeat, removing the two least likely elements and replacing them with a node whose probability is the sum, until you have only one node left. 1 choice is 0 bits of information and thus the encoding of this node is 0 bits long. If all the probabilities are equal, you'll find this produces equal-length encodings. Suppose we have 4 symbols, {A, B, C, D}, with probabilities {.4, .3, .2, .1}. Take the two least likely symbols, C and D, and make a parent for them (call it CD) with probability .1+.2=.3. The encoding of C ic CD0 and the encoding of D is CD1. The list is now {A, B, CD} (or it could be {A, CD, B}) with probabilites {.4, .3, .3}. Again merge the least likely two elements, producing node BCD with probability .6 and making the encoding of B = BCD0, CD = BCD1. The list is now {BCD, A} with probabilities {.6, .4} and the final merge gives a list {ABCD} with probability {1}. BCD = ABCD0 and A = ABCD1. The encodings work out to be: A = 1 B = 00 C = 010 D = 011 Of course, the choice of which child got "0" and which got "1" was arbitrary, so we could make it A = 0 B = 10 C = 110 D = 111 if we like. The average length of a symbol is 1*.4 + 2*.3 + 3*.2 + 3*.1 = .4+.6+.9 = 1.9, a slight improvement over the 2 bits it would take using a fixed-length code. If we give the scheme more symbols to play with, we get better results. Say we group the symbols into pairs, giving 16 symbols: AA = 0.16 AB = 0.12 AC = 0.08 AD = 0.04 BA = 0.12 BB = 0.09 BC = 0.06 BD = 0.03 CA = 0.08 CB = 0.06 CC = 0.04 CD = 0.02 DA = 0.04 DB = 0.03 DC = 0.02 DD = 0.01 This produces the encodings (it depends on the order in which you list equal elements, but they're all equally efficient): AA = 001 * .16 = .48 AB = 100 * .12 = .36 AC = 0001 * .08 = .32 AD = 1101 * .04 = .16 BA = 101 * .12 = .36 BB = 111 * .09 = .27 BC = 0110 * .06 = .24 BD = 01011 * .03 = .15 CA = 0100 * .08 = .32 CB = 0111 * .06 = .24 CC = 00001 * .04 = .20 CD = 11001 * .02 = .10 DA = 00000 * .04 = .20 DB = 11000 * .03 = .15 DC = 010100 * .02 = .12 DD = 010101 * .01 = .06 Total 1.00 3.73 This figure of 3.73/2 = 1.865 bits per symbol is better. Note that I piked a difficult-to-compress probability distribution. The Shannon theorem lower bound on the number of bits per symbol, assuming all symbols are independent, is 1.846439344... bits per symbol. Note that english text has much more skewed probabilities, and also has strong inter-symbol dependencies. Perfect compression of english text would use between 1.5 and 2 bits per character. Anyway, now do you know what you want? I'm still not quite sure, but have tried to answer the common qustions. -- -Colin
trantow@csd4.csd.uwm.edu (Jerry J Trantow) (10/04/89)
In article <467@crash.cts.com> uzun@pnet01.cts.com (Roger Uzun) writes: >Of course i realize that the huffman tree for a file that has >256 unique elements is not optimal for files that have less than that, >but I am trying to simplify things a bit. >So I should have asked for the huffman bit patterns and bit counts >for the constant encoding tree that would be used on files that >have 256 unique elements. >-Roger uzun Rodger, I'm sure you realize that the bit count equals the depth of the node. (depth = #paths between root and node) If you have 256 unique elements and a "constant encoding" tree, I assume they have equal probabilities? If they have equal probabilities they are all on the same lev. If I'm not really sure if this is what you mean. I'm currently doing huffman compression on 8SVX files. It is not that tough to form the tree. My only concern is getting the decoding to run fast.
uzun@pnet01.cts.com (Roger Uzun) (10/04/89)
Well, I did not say that they occur with equal frequency. In fact if there are 256 unique bytes in the file 255 may occur only once or there may be equal distribution but the bit patterns and bit lengths for the 1st...256th most common bytes will be constant in any case. i am finding it difficult to easily create a table of bit patterns and bit lengths in the patterns from a file. I thought if i had the 256 unique byte case as a constant things would be less than optimal but easy to do. Do you have any suggestions on an algorithm for creating a data structre that has the bit patterns/bit lengths for a given file? Thanks -Roger UUCP: {hplabs!hp-sdd ucsd nosc}!crash!pnet01!uzun ARPA: crash!pnet01!uzun@nosc.mil INET: uzun@pnet01.cts.com
uzun@pnet01.cts.com (Roger Uzun) (10/04/89)
What I was saying was that the bit patterns/bit lengths of the codes used to represent the tokens in a file where there are 256 unique tokens is a constant and will work for all files. that is the table I wanted. -Roger It is true that the bit patterns and bit lengths used to represent tokens varies depending on # of unique tokens in a file, but if that # is assumed to be 256, you have no problem. -Roger UUCP: {hplabs!hp-sdd ucsd nosc}!crash!pnet01!uzun ARPA: crash!pnet01!uzun@nosc.mil INET: uzun@pnet01.cts.com
trantow@csd4.csd.uwm.edu (Jerry J Trantow) (10/05/89)
In article <476@crash.cts.com> uzun@pnet01.cts.com (Roger Uzun) writes: >do. Do you have any suggestions on an algorithm for creating >a data structre that has the bit patterns/bit lengths for a given >file? >-Roger > Create a tree, with the normal Huffman order (lowest probabilities deepest) Have a field in the node to hold the code. Starting at the root, shift a 1 into the code of the left-child and a 0 into the code of the right child. Continue this until all the leaves have been assigned a code. If you want you can have a field with the bit length, but you can also keep track of the depth which == bit length. If you are implement this correctly every child node will have the code of it's parent shifted and a 1 added if the child is left and the parent's code shifted with a 0 if it is the right child. I hope this is what you wanted.
mwm@eris.berkeley.edu (Mike (I'll think of something yet) Meyer) (10/05/89)
In article <476@crash.cts.com> uzun@pnet01.cts.com (Roger Uzun) writes:
<
<Well, I did not say that they occur with equal frequency. In fact if there
<are 256 unique bytes in the file 255 may occur only once or there may
<be equal distribution but the bit patterns and bit lengths for the
<1st...256th most common bytes will be constant in any case.
This is false. If the first most common byte occurs > 50% of the time,
it will have a one-bit encoding. If the 256 codes occur evenly enough,
every code will be eight bits long.
<mike
--
Es brillig war. Die schlichte Toven Mike Meyer
Wirrten und wimmelten in Waben; mwm@berkeley.edu
Und aller-mumsige Burggoven ucbvax!mwm
Die mohmem Rath' ausgraben. mwm@ucbjade.BITNET
jms@tardis.Tymnet.COM (Joe Smith) (10/05/89)
In article <477@crash.cts.com> uzun@pnet01.cts.com (Roger Uzun) writes: >What I was saying was that the bit patterns/bit lengths of the codes used >to represent the tokens in a file where there are 256 unique tokens is a >constant and will work for all files. ^^^^^^^ No, it's not. If you have 8-bit bytes (256 possible tokens), they can be encoded as 3 bits each, if only 8 of the possible 256 tokens are actually used and have equal probability. If one token is used 99% of the time and the remaining 255 tokens have equal distribution, then huffman encoding will use a single bit for the majority token, and 9 bits for all the rest. There's no way you can call that a constant. After reading all the previous followups, I *still* don't understand what you are asking for. Here is my guess as to you are asking: Q1) Given a file that is several thousand bytes long and uses 8-bit bytes, what is the best huffman tree for encoding it? A1) Using a "constant tree" will never be optimal, you have to analyze the data, count the number of times each character occurs, (such as "how many A's, how many B's, etc) and then build the tree. Q2) If the file is very short, such as 256 bytes or less in length, can a constant tree be used? A2) I say "no", you still have to analyze the data. I'd say the table you are asking for cannot exist, by definition. >It is true that the bit patterns and bit lengths used to represent tokens >varies depending on # of unique tokens in a file, but if that # is assumed >to be 256, you have no problem. Non sequitur. I understand that at all. -- Joe Smith (408)922-6220 | SMTP: JMS@F74.TYMNET.COM or jms@tymix.tymnet.com McDonnell Douglas FSCO | UUCP: ...!{ames,pyramid}!oliveb!tymix!tardis!jms PO Box 49019, MS-D21 | PDP-10 support: My car's license plate is "POPJ P," San Jose, CA 95161-9019 | narrator.device: "I didn't say that, my Amiga did!"
uzun@pnet01.cts.com (Roger Uzun) (10/05/89)
I had misunderstood huffman encoding then. I thought that regardless of the actual frequency of a token in a file, the bit pattern and bit length used to encode that token was dependent only on its relative frequency to other tokens in the file. Say file A has 10 unique tokens in it and the most probable token occurs 90% of the time. File B also has 10 unqie tokens in it and the most probable token occurs 30% of the time. It was my understanding of huffman that the bit patterns used to encode the most probable token in each of these cases would be the same. Apparently you are saying this is not true, perhaps I am describing a "subset" of huffman encoding that would use constant trees, and of course be less efficient. Basically in my scheme of things any file that has all 256 possible bytes represented in it (the file contains 256 unique tokens) would have the same bit patterns/bit lengths used to encode its tokens. The tokens associated with the bit patterns/bit lengths would vary depending on their relative probabilities in that one file, but the encoding bit patterns/bit lengths would remain the same between the 2 files. HMMM I thought huffman only considered the RELATIVE probabilities of tokens in a file, perhaps you are thinking of a more advanced algorithm based on huffman encoding. -Roger Uzun UUCP: {hplabs!hp-sdd ucsd nosc}!crash!pnet01!uzun ARPA: crash!pnet01!uzun@nosc.mil INET: uzun@pnet01.cts.com
mwm@eris.berkeley.edu (Mike (I'll think of something yet) Meyer) (10/06/89)
In article <486@crash.cts.com> uzun@pnet01.cts.com (Roger Uzun) writes:
<I had misunderstood huffman encoding then. I thought that regardless of
<the actual frequency of a token in a file, the bit pattern and bit length
<used to encode that token was dependent only on its relative frequency to
<other tokens in the file.
Correct, you had misundersood the encoding.
Out of curiosity, what's the application (apologies if you stated this
before and I missed it)? There may still be a suitable encoding
scheme for it.
<mike
--
Don't tell me how to live my life Mike Meyer
Don't tell me what to do mwm@berkeley.edu
Repression is always brought about ucbvax!mwm
by people with politics and attitudes like you. mwm@ucbjade.BITNET