d5kwedb@dtek.chalmers.se (Kristian Wedberg) (06/11/91)
Not to be outdone, I've invented my very own dictionary based compression algorithm. It uses the greedy algorithm for parsing the input, which according to the gurus should set back compression by some 5 or 10 percent due to sub-optimal choice of strings. What I don't understand is why there is a optimal maximum for the stringlengths! Put differently, for a given source text, my model produces the lowest entropy with a maximum stringlength of about a dozen characters, but it increases with longer strings. Now, assuming the longer strings are coded with their appropriate probabilities (as I believe they are>-), it seems strange that longer strings could worsen compression. Intuitively, the presence of longer strings should, ON AVERAGE, improve results. Why? A longer string could be many times less probable and still be coded in fewer bits compared to two strings of half the size. Me thinks, that is. Unfortunately, my experiments doesn't bear this out. So, what am I doing wrong? Ideas, flames, pointers et al accepted. k ki kit kitt kitte eom
dcarr@hobbit.gandalf.ca (Dave Carr) (06/11/91)
In <1991Jun10.230518.29430@mathrt0.math.chalmers.se> d5kwedb@dtek.chalmers.se (Kristian Wedberg) writes: >Not to be outdone, I've invented my very own dictionary based compression >.... What I don't understand is why there is a optimal maximum for the >stringlengths. It would depend on your update model. It standard LZW, one character at a time, then I am puzzled too. Long strings could only get in the dictionary if the string minus 1 character got used. If, however, your update was done by concatenating 2 strings (ala V.42 bis), then the longer strings may never get used, but they are taking up space in the dictionary.