chris@peregrine.UUCP (Chris Cole) (06/09/85)
<> I do not know if the following problem has been solved. If so, please send a reference: Consider an alphabetized list of words which has been compressed by replacing the characters repeated between subsequent entries with a count. For example, "hello", "help" and "hi" are stored as: [0] hello [3] p [1] i The numbers in brackets are the counts of repeated characters. This is a common way of compressing keys stored in indices. The question is: What is the average length of the stored words? For the sake of simplicity, we can: 1. Ignore the fixed overhead of the count of repeated characters 2. Assume some simple distribution of word lengths (e.g., uniform) Four cases are of interest: 1. The simple case: The probability of a character in a word is the same for all characters. 2. The known probability case: The probability of a character is known but not necessarily the same for all characters. 3. The realisitic case: The probability of a character or sequence of characters is known. 4. The unknown probability case: Nothing is known about the probability of characters. As usual, please reply via mail and I will post a summary of the replies.