[comp.sys.amiga.tech] huffman encoding

uzun@pnet01.cts.com (Roger Uzun) (10/02/89)

I am working on a Huffman encoding program and, since the 1st 256
bit patterns used by huffman for the most common bytes in a file
are constant, I was wondering if anyone had them and could post them
for me?  I do not want to create a tree for these values since they
are constant and I imagine someone has already done this.
Thanks Much

-Roger


UUCP: {hplabs!hp-sdd ucsd nosc}!crash!pnet01!uzun
ARPA: crash!pnet01!uzun@nosc.mil
INET: uzun@pnet01.cts.com

uzun@pnet01.cts.com (Roger Uzun) (10/03/89)

Of course i realize that the huffman tree for a file that has
256 unique elements is not optimal for files that have less than that,
but I am trying to simplify things a bit.
 
So I should have asked for the huffman bit patterns and bit counts
for the constant encoding tree that would be used on files that
have 256 unique elements.

-Roger uzun

UUCP: {hplabs!hp-sdd ucsd nosc}!crash!pnet01!uzun
ARPA: crash!pnet01!uzun@nosc.mil
INET: uzun@pnet01.cts.com

morgan@camtwh.UUCP (Morgan W. Jones) (10/03/89)

In article <467@crash.cts.com> uzun@pnet01.cts.com (Roger Uzun) writes:
>Of course i realize that the huffman tree for a file that has
>256 unique elements is not optimal for files that have less than that,
>but I am trying to simplify things a bit.
>So I should have asked for the huffman bit patterns and bit counts
>for the constant encoding tree that would be used on files that
>have 256 unique elements.

You realize of course that a huffman tree generated for a binary file
would be rather useless for a text file (or at least you'd get very
little compression).  If you wanted to do a fixed compression in one
pass without determining the type of the file, you might want to
initialize the char count to all ones (rather than all zeros) and
count the number of occurances in the first two blocks (1024 bytes) of
the file, build your tree and compression table (which is very easy if
you know anything at all about trees), and then compress the file.
-- 
Morgan W. Jones					(morgan@camtwh)
Orama Incorporated, Toronto, Canada.		(416) 369-5088

ccplumb@rose.waterloo.edu (Colin Plumb) (10/04/89)

In article <467@crash.cts.com> uzun@pnet01.cts.com (Roger Uzun) writes:
>So I should have asked for the huffman bit patterns and bit counts
>for the constant encoding tree that would be used on files that
>have 256 unique elements.

Er... one of us is confused.  Huffman encoding relies on knowing the
static distribution of however many values you're trying to represent.
Without these probabilities, you can't create a Huffman encoding.

Supposing you do have a set of probabilities, p_0, p_1, p_2,...,p_n,
to get an optimal Huffman tree, sort them, take the two least likely
elements of the list, and create a common parent for them.  The probability
of this parent is the sum of the probabilities of its children, and the
encodings of its children are the encodings of the parent followed by "0"
and "1".  Remove the two child nodes from the set of probabilities and add
the parent, preserving the sorting.  Repeat, removing the two least likely
elements and replacing them with a node whose probability is the sum, until
you have only one node left.  1 choice is 0 bits of information and thus
the encoding of this node is 0 bits long.

If all the probabilities are equal, you'll find this produces equal-length
encodings.  Suppose we have 4 symbols, {A, B, C, D}, with probabilities
{.4, .3, .2, .1}.

Take the two least likely symbols, C and D, and make a parent for them
(call it CD) with probability .1+.2=.3.  The encoding of C ic CD0 and
the encoding of D is CD1.  The list is now {A, B, CD} (or it could be
{A, CD, B}) with probabilites {.4, .3, .3}.  Again merge the least likely
two elements, producing node BCD with probability .6 and making the encoding
of B = BCD0, CD = BCD1.  The list is now {BCD, A} with probabilities {.6, .4}
and the final merge gives a list {ABCD} with probability {1}.  BCD = ABCD0
and A = ABCD1.

The encodings work out to be:
A = 1
B = 00
C = 010
D = 011

Of course, the choice of which child got "0" and which got "1" was arbitrary,
so we could make it
A = 0
B = 10
C = 110
D = 111
if we like.

The average length of a symbol is 1*.4 + 2*.3 + 3*.2 + 3*.1 = .4+.6+.9 = 1.9,
a slight improvement over the 2 bits it would take using a fixed-length code.
If we give the scheme more symbols to play with, we get better results.

Say we group the symbols into pairs, giving 16 symbols:

AA = 0.16
AB = 0.12
AC = 0.08
AD = 0.04
BA = 0.12
BB = 0.09
BC = 0.06
BD = 0.03
CA = 0.08
CB = 0.06
CC = 0.04
CD = 0.02
DA = 0.04
DB = 0.03
DC = 0.02
DD = 0.01

This produces the encodings (it depends on the order in which you list equal
elements, but they're all equally efficient):

AA = 001	* .16 = .48
AB = 100	* .12 = .36
AC = 0001	* .08 = .32
AD = 1101	* .04 = .16
BA = 101	* .12 = .36
BB = 111	* .09 = .27
BC = 0110	* .06 = .24
BD = 01011	* .03 = .15
CA = 0100	* .08 = .32
CB = 0111	* .06 = .24
CC = 00001	* .04 = .20
CD = 11001	* .02 = .10
DA = 00000	* .04 = .20
DB = 11000	* .03 = .15
DC = 010100	* .02 = .12
DD = 010101	* .01 = .06
Total		 1.00  3.73

This figure of 3.73/2 = 1.865 bits per symbol is better.  Note that I piked
a difficult-to-compress probability distribution.  The Shannon theorem lower
bound on the number of bits per symbol, assuming all symbols are
independent, is 1.846439344... bits per symbol.  Note that english text has
much more skewed probabilities, and also has strong inter-symbol dependencies.
Perfect compression of english text would use between 1.5 and 2 bits per
character.

Anyway, now do you know what you want?  I'm still not quite sure, but have
tried to answer the common qustions.
-- 
	-Colin

trantow@csd4.csd.uwm.edu (Jerry J Trantow) (10/04/89)

In article <467@crash.cts.com> uzun@pnet01.cts.com (Roger Uzun) writes:
>Of course i realize that the huffman tree for a file that has
>256 unique elements is not optimal for files that have less than that,
>but I am trying to simplify things a bit.
>So I should have asked for the huffman bit patterns and bit counts
>for the constant encoding tree that would be used on files that
>have 256 unique elements.
>-Roger uzun
 
Rodger, I'm sure you realize that the bit count equals the depth of the
node. (depth = #paths between root and node)  If you have 256 unique
elements and a "constant encoding" tree, I assume they have equal
probabilities?  If they have equal probabilities they are all on the 
same lev. If I'm not really sure if this is what you mean.
I'm currently doing huffman compression on 8SVX files.  It is not that
tough to form the tree.  My only concern is getting the decoding to
run fast.

uzun@pnet01.cts.com (Roger Uzun) (10/04/89)

Well, I did not say that they occur with equal frequency.  In fact if there
are 256 unique bytes in the file 255 may occur only once or there may
be equal distribution but the bit patterns and bit lengths for the
1st...256th most common bytes will be constant in any case.  i am 
finding it difficult to easily create a table of bit patterns and bit
lengths in the patterns from a file.  I thought if i had the 256 unique
byte case as a constant things would be less than optimal but easy to
do.  Do you have any suggestions on an algorithm for creating
a data structre that has the bit patterns/bit lengths for a given
file?
Thanks
-Roger

UUCP: {hplabs!hp-sdd ucsd nosc}!crash!pnet01!uzun
ARPA: crash!pnet01!uzun@nosc.mil
INET: uzun@pnet01.cts.com

uzun@pnet01.cts.com (Roger Uzun) (10/04/89)

What I was saying was that the bit patterns/bit lengths of the codes used
to represent the tokens in a file where there are 256 unique tokens is a
constant and will work for all files.  that is the table I wanted.
-Roger

It is true that the bit patterns and bit lengths used to represent tokens
varies depending on # of unique tokens in a file, but if that # is assumed
to be 256, you have no problem.
-Roger

UUCP: {hplabs!hp-sdd ucsd nosc}!crash!pnet01!uzun
ARPA: crash!pnet01!uzun@nosc.mil
INET: uzun@pnet01.cts.com

trantow@csd4.csd.uwm.edu (Jerry J Trantow) (10/05/89)

In article <476@crash.cts.com> uzun@pnet01.cts.com (Roger Uzun) writes:
>do.  Do you have any suggestions on an algorithm for creating
>a data structre that has the bit patterns/bit lengths for a given
>file?
>-Roger
>
Create a tree, with the normal Huffman order (lowest probabilities deepest)
Have a field in the node to hold the code.  Starting at the root, shift a 1
into the code of the left-child and a 0 into the code of the right child.
Continue this until all the leaves have been assigned a code.  If you want
you can have a field with the bit length, but you can also keep track
of the depth which == bit length.  If you are implement this correctly
every child node will have the code of it's parent shifted and a 1
added if the child is left and the parent's code shifted with a 0
if it is the right child.

I hope this is what you wanted.

mwm@eris.berkeley.edu (Mike (I'll think of something yet) Meyer) (10/05/89)

In article <476@crash.cts.com> uzun@pnet01.cts.com (Roger Uzun) writes:
<
<Well, I did not say that they occur with equal frequency.  In fact if there
<are 256 unique bytes in the file 255 may occur only once or there may
<be equal distribution but the bit patterns and bit lengths for the
<1st...256th most common bytes will be constant in any case.

This is false. If the first most common byte occurs > 50% of the time,
it will have a one-bit encoding. If the 256 codes occur evenly enough,
every code will be eight bits long.

	<mike



--
Es brillig war. Die schlichte Toven			Mike Meyer
Wirrten und wimmelten in Waben;				mwm@berkeley.edu
Und aller-mumsige Burggoven				ucbvax!mwm
Die mohmem Rath' ausgraben.				mwm@ucbjade.BITNET

jms@tardis.Tymnet.COM (Joe Smith) (10/05/89)

In article <477@crash.cts.com> uzun@pnet01.cts.com (Roger Uzun) writes:
>What I was saying was that the bit patterns/bit lengths of the codes used
>to represent the tokens in a file where there are 256 unique tokens is a
>constant and will work for all files.
 ^^^^^^^   No, it's not.

If you have 8-bit bytes (256 possible tokens), they can be encoded as
3 bits each, if only 8 of the possible 256 tokens are actually used and
have equal probability.  If one token is used 99% of the time and the
remaining 255 tokens have equal distribution, then huffman encoding will
use a single bit for the majority token, and 9 bits for all the rest.

There's no way you can call that a constant.

After reading all the previous followups, I *still* don't understand
what you are asking for.  Here is my guess as to you are asking:

Q1) Given a file that is several thousand bytes long and uses 8-bit bytes,
    what is the best huffman tree for encoding it?

A1) Using a "constant tree" will never be optimal, you have to analyze the
    data, count the number of times each character occurs, (such as "how
    many A's, how many B's, etc) and then build the tree.  

Q2) If the file is very short, such as 256 bytes or less in length, can
    a constant tree be used?

A2) I say "no", you still have to analyze the data.

I'd say the table you are asking for cannot exist, by definition.

>It is true that the bit patterns and bit lengths used to represent tokens
>varies depending on # of unique tokens in a file, but if that # is assumed
>to be 256, you have no problem.

Non sequitur.  I understand that at all.


-- 
Joe Smith (408)922-6220 | SMTP: JMS@F74.TYMNET.COM or jms@tymix.tymnet.com
McDonnell Douglas FSCO  | UUCP: ...!{ames,pyramid}!oliveb!tymix!tardis!jms
PO Box 49019, MS-D21    | PDP-10 support: My car's license plate is "POPJ P,"
San Jose, CA 95161-9019 | narrator.device: "I didn't say that, my Amiga did!"

uzun@pnet01.cts.com (Roger Uzun) (10/05/89)

I had misunderstood huffman encoding then.  I thought that regardless of
the actual frequency of a token in a file, the bit pattern and bit length
used to encode that token was dependent only on its relative frequency to
other tokens in the file.  Say file A has 10 unique tokens in it and
the most probable token occurs 90% of the time.  File B also has 10 unqie
tokens in it and the most probable token occurs 30% of the time.  It was
my understanding of huffman that the bit patterns used to encode the
most probable token in each of these cases would be the same. 
Apparently you are saying this is not true, perhaps I am describing
a "subset" of huffman encoding that would use constant trees, and
of course be less efficient.  Basically in my scheme of things any file
that has all 256 possible bytes represented in it (the file contains
256 unique tokens) would have the same bit patterns/bit lengths used
to encode its tokens.  The tokens associated with the bit patterns/bit
lengths would vary depending on their relative probabilities in that one
file, but the encoding bit patterns/bit lengths would remain the same
between the 2 files.
HMMM I thought huffman only considered the RELATIVE probabilities of
tokens in a file, perhaps you are thinking of a more advanced algorithm
based on huffman encoding.
-Roger Uzun

UUCP: {hplabs!hp-sdd ucsd nosc}!crash!pnet01!uzun
ARPA: crash!pnet01!uzun@nosc.mil
INET: uzun@pnet01.cts.com

mwm@eris.berkeley.edu (Mike (I'll think of something yet) Meyer) (10/06/89)

In article <486@crash.cts.com> uzun@pnet01.cts.com (Roger Uzun) writes:
<I had misunderstood huffman encoding then.  I thought that regardless of
<the actual frequency of a token in a file, the bit pattern and bit length
<used to encode that token was dependent only on its relative frequency to
<other tokens in the file.

Correct, you had misundersood the encoding.

Out of curiosity, what's the application (apologies if you stated this
before and I missed it)? There may still be a suitable encoding
scheme for it.

	<mike

--
Don't tell me how to live my life			Mike Meyer
Don't tell me what to do				mwm@berkeley.edu
Repression is always brought about			ucbvax!mwm
by people with politics and attitudes like you.		mwm@ucbjade.BITNET