[net.math] Average length of keys in index with compression.

chris@peregrine.UUCP (Chris Cole) (06/09/85)

<>
I do not know if the following problem has been solved. If so, please send
a reference:
Consider an alphabetized list of words which has been compressed by replacing
the characters repeated between subsequent entries with a count. For example,
"hello", "help" and "hi" are stored as:
[0] hello
[3] p
[1] i
The numbers in brackets are the counts of repeated characters. This is a common
way of compressing keys stored in indices. The question is:
What is the average length of the stored words?
For the sake of simplicity, we can:
1. Ignore the fixed overhead of the count of repeated characters 
2. Assume some simple distribution of word lengths (e.g., uniform)
Four cases are of interest:
1. The simple case: The probability of a character in a word is the same for
	all characters.
2. The known probability case: The probability of a character is known but
	not necessarily the same for all characters.
3. The realisitic case: The probability of a character or sequence of
	characters is known.
4. The unknown probability case: Nothing is known about the probability of
	characters.
As usual, please reply via mail and I will post a summary of the replies.