[comp.compression] Greedy weirdness...

d5kwedb@dtek.chalmers.se (Kristian Wedberg) (06/11/91)

Not to be outdone, I've invented my very own dictionary based compression
algorithm. It uses the greedy algorithm for parsing the input, which
according to the gurus should set back compression by some 5 or 10 percent
due to sub-optimal choice of strings. What I don't understand is why
there is a optimal maximum for the stringlengths! Put differently,
for a given source text, my model produces the lowest entropy with
a maximum stringlength of about a dozen characters, but it increases
with longer strings.

Now, assuming the longer strings are coded with their appropriate 
probabilities (as I believe they are>-), it seems strange that longer strings
could worsen compression. Intuitively, the presence of longer strings should,
ON AVERAGE, improve results. Why? A longer string could be many times
less probable and still be coded in fewer bits compared to two strings
of half the size. Me thinks, that is. Unfortunately, my experiments doesn't bear
this out. So, what am I doing wrong? Ideas, flames, pointers et al accepted.

	k ki kit kitt kitte eom

dcarr@hobbit.gandalf.ca (Dave Carr) (06/11/91)

In <1991Jun10.230518.29430@mathrt0.math.chalmers.se> d5kwedb@dtek.chalmers.se (Kristian Wedberg) writes:

>Not to be outdone, I've invented my very own dictionary based compression
>.... What I don't understand is why there is a optimal maximum for the 
>stringlengths.

It would depend on your update model.  It standard LZW, one character
at a time, then I am puzzled too.  Long strings could only get in the
dictionary if the string minus 1 character got used.

If, however, your update was done by concatenating 2 strings (ala 
V.42 bis), then the longer strings may never get used,
but they are taking up space in the dictionary.