[comp.compression] theoretical compression factor

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/25/91)

I'm investigating a random number technique to compress text.

It runs along the following lines.

Some strings have `low complexity' in the sense of Martin-Lo{f.
(The program that can type them out is much shorter than the
actual string).

For such strings it is obviously better to store or transmit the _program_ 
than the string. Some method needs to be found to enumerate the possible 
programs useful for this purpose.

I am thinking of using strings generated by pseudo-random means
(I understand in similar vein to LPC coding). The message is first analyzed 
to determine some basic statistics (e.g. frequencies of groups of bits of 
lengths 1 to N) and this is used to select a generator. The generator will 
generate strings with groups in the appropriate proportions -- this sequence 
is compared with the original message and only the `differences' saved. The 
idea is the generator will be able to create a string that is very similar to 
the original and thereby reduce the number of differences that need to be
stored/sent.

Beside a great deal of freedom with generating the random strings,
another field of experimentation seems to be the method of _comparison_.
Some kinds of `differences' will be more sensitive than others to
certain types of noise. For example, if the `random string' is:

	010110111011101111011111

and the original message was:

	000010110111011110111110

there are 9 differences using `phase 0' but none using `phase 3'.

What kind of compression factors might be expected? Let's take a simple case 
of bitstrings.

Suppose we have a non-random string with p 1's and (1-p) 0's.
If we generate a random string of bits with this same proportion of 1's 
and 0's in how many places will they differ? If we call the message A
and the random string B we have:

Prob(A=1 & B=0 | A=0 & B=1) =
	Prob(A=1)Prob(B=0) + Prob(A=0)Prob(B=1)

Independence is presumed (which is not absolutely true since we analyze the 
message string to _set up_ the generator). We then find

Prob(difference) = 2p(1-p)

In a message of N bits we would then find 2p(1-p)N differences on average. 
We could adjust a phase parameter to further minimize this number of
differences, but the O(N) dependency would remain (I _think_).

To send all the `differences' we just encode the sequence of i_j where
A_i_j != B_{i_j+k}. This can be done using `delta modulation' (i.e. the
differences of the i_{j+1} and i_j are encoded). The average size
of each difference will be O(log N/N) bits each.

We therefore have a total compressed string of

~	2p(1-p)N log N/N = 2p(1-p) log N

(ignoring the parameters required to reconstitute the message; these would 
also need to be sent but are of O(log N) size anyway) and the 
compression ratio:

	|compressed string|
	---------------
	|original string|

~	2p(1-p) log N / N

I presume the constant of proportionality will be fairly large since with the 
following strings I get only 80-90% compression factors (using the best of 
100 random bit generators chosen at random :-):

abcdefghijklmnopqrstuvwxyz<NEWLINE>
189 bits
max compression=0.873016

this is a test message of a very short string<NEWLINE>
322 bits
max compression=0.813665

These small examples are not very impressive, but the complexity
of the method as N->inf is interesting (log N/N).

Any comments/suggestions? (Please be nice if I've made another of my
classic tremendous goofs :-).

-kym

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/25/91)

I found one little problem with my algebra & understanding.
Delta modulation takes N bits into O(N) bits, NOT less. 

The theoretical compression factor for my binary strings example
would therefore be

	2p(1-p)

where p is the proportion of 0's (or 1's -- it doesn't matter).

For /usr/dict/words the proportion of 1's is about 56% (only
considering the lsb 7 bits of each char). This should give
a compression factor of about 49% with the method outlined.

Tables of random ascii digits contain about .38 1's (taking only lsb 7
bits) which leads to a similar .47 compression factor. If we massage
digit strings to 4-bit groups (by subtracting the '0') we find only
about 6% 1's and therefore the compression factor would be .1 (i.e. a 1Mb 
file of random digits -> 100Kb).

Again, if there are any glaring errors, please comment.

-kym

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/25/91)

In article <1991Mar25.054838.15588@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
> The theoretical compression factor for my binary strings example
> would therefore be
> 	2p(1-p)
> where p is the proportion of 0's (or 1's -- it doesn't matter).

There is no way that a compressor can turn every string with as many 0's
as 1's into a string of half the length.

---Dan

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/26/91)

In article <15098:Mar2512:53:1291@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> There is no way that a compressor can turn every string with as many 0's
> as 1's into a string of half the length.

To forestall argument: There are (2n)!/(n!)^2 bit strings of length 2n
with n bits set. This exceeds (2n/e)^(2n) / (n/e)^(2n)(2\pi n)exp(1/6n),
i.e., 2^(2n) / (2\pi n)exp(1/6n). So for large n there are at least
2n - 3 - log n bits of information in such strings, and there is no way
they can all be compressed to length n, as the method in question claims
to do.

---Dan

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/26/91)

In <1991Mar25.031214.25696@bingvaxu.cc.binghamton.edu> I wrote:
>Some strings have `low complexity' in the sense of Martin-Lo{f.
^^^^^^^^^^^^^
>(The program that can type them out is much shorter than the
>actual string).
>For such strings it is obviously better to store or transmit the _program_ 
     ^^^^^^^^^^^^
>than the string. Some method needs to be found to enumerate the possible 
>programs useful for this purpose.

In <15098:Mar2512:53:1291@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) wrote:
>There is no way that a compressor can turn every string with as many 0's
					    ^^^^^
>as 1's into a string of half the length.

In case I've misrepresented myself as Dan perhaps thinks, I've included some 
introductory text from my original message.

Dan's point is of course correct -- there are certain bitstrings whose 
complexity prohibits any program from being written that is in any
sense `shorter' than the given bitstring. Essentially the only programs that
can reproduce such bitstrings are:

	PRINT ``SEQUENCE-OF-BITS''

However, by the definition of certain mathematicians, such bitsrings would be 
regarded as essentially ``random'' (see, e.g., Knuth v2 who is/was _not_ one 
of these) and I _think_ the number of text files, with which I am primarily
concerned, that exhibit such a property are vanishingly small;
text files on a computer are USUALLY generated by programs shorter than they 
are!

-kym

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/26/91)

My last attempt to post this article with either bumped or dumped in
Dan's IN tray -- I don't know which. Sorry if you've seen it before.
===

Thanks, Dan, for your input. 

In article <15098:Mar2512:53:1291@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> There is no way that a compressor can turn every string with as many 0's
> as 1's into a string of half the length.

In article <18263:Mar2518:10:4091@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> To forestall argument: There are (2n)!/(n!)^2 bit strings of length 2n
> with n bits set. This exceeds (2n/e)^(2n) / (n/e)^(2n)(2\pi n)exp(1/6n),
> i.e., 2^(2n) / (2\pi n)exp(1/6n). So for large n there are at least
> 2n - 3 - log n bits of information in such strings, and there is no way
> they can all be compressed to length n...

While I have to agree with Dan's argument above a question that would be 
relevant to this group is WHAT FRACTION of strings can be compressed to 
strings of, say, <= 1/2 their length? Using the (possibly buggy) argument 
below, I find this to be ~ 1/sqrt(N). I would be interested in any refinement 
or other pointers.

\begin{argument}

One way we might approximate an answer is to consider `random' number
generators (sorry, this is an old axe :-).

A generator of bits, say, takes a number of parameters and can output, 
if properly managed, a sequence of 2^n bits. In certain circumstances we can 
guarantee that different n-bit parameter values will generate different bit 
sequences.

Let's say we have a random bit generator of N bits long (in some kind of
Turing/Wang formalism). It can generate of 2^N bits. By changing any of the 
bits in the N-bit program the sequence produced will be different. 
(I am talking about the limit of N here where _most_ of the N bits of 
generator are in fact parameters and not much is actual code where
bit changes are typically `uninteresting' :-).

We therefore arrive at:

N-bit `program'			=>	2^N bits
2^N possible N-bit programs	=>	(2^N)^2 bits

Therefore the number of 2N-bit strings that can be generated (by this 
means) by N-bit programs is (2^N)^2/(2N) = 2^(2N-1)/N.

Obviously there are a total of 2^(2N) 2N-bit strings so our random bit 
programs can generate the fraction:

	2^(2N-1)/N
	----------
	2^(2N)

=	1 / 2N

of them.

Now, the number of 2N-bit strings with 1/2 1's= (2N)!/(N!)^2.  Using the 
simple approx n! = sqrt(2 n pi) (n/e)^n (actually it's larger by a factor of 
about O(1+1/12n)) we find this last to be about:

	sqrt(2 2N pi) (2 N/e)^(2N)
~	--------------------------
	2 N pi (N/e)^(2N)

	2^(2N)
=	------
	sqrt(N pi)

So the fraction of 2N-bit strings that have 1/2 1's is therefore

	2^(2N)
~	--------- / 2^(2N)
	sqrt(N pi)

=	1 / sqrt(N pi)

What proportion of these might our N-bit programs be capable of generating?
All things being equal we find:

	1/2N
~	----------
	1/sqrt(N pi)

=	1/2 sqrt(pi/N)

which obviously ->0 (and is only good for large N anyway).

\end{argument}

If N were in the 10k's of bits (typical of files in my directory, anyway) then 
we would expect a 50% compression of `balanced' files using random number 
generators to occur in ~ 1% of cases -- not too often! 

As a kind of VERY ROUGH experiment, I ran all my current files through the Unix 
compress utility and selected those that had close to 1/2 1's. We find the 
following:

Fraction 1's:	Reduction factor:

0.506969	38.07% 
0.47619		-33.33%
0.492063	-55.55%
0.491462	33.09%
0.502947	36.47%
0.503021	35.94%
0.478355	26.01%
0.492063	-55.55%
0.49483		30.57%
0.47619		-55.55%
0.510204	32.50%
0.47619		-55.55%
0.525429	31.15%
0.505414	22.98%
0.487941	12.12%
0.514909	48.48%

Out of about 100 files these above contained between .48 and .52 (approx) 
1's vs 0's.  There were 16 attempts to compress (using the `standard' 
Lempel-Ziv adaptive coding).  There were 5 failures.  The average reduction 
where successful (geometric mean) was 30.04% (i.e. files compressed to about 
70% of original length).

Only 1 -- the last -- came _close_ to 50% reduction. Maybe a statistical
accident? But then again this utility doesn't use the method I was
talking about...

My remaining argument, not having a fully-working version of the 
compression algorithm I was discussing in my original post to run on text 
files, must run along the lines of ``text files ain't random'' and thefore 
the ~ 1/sqrt(N) argument above does not apply.

-kym

ttobler@unislc.uucp (Trent Tobler) (03/27/91)

From article <1991Mar25.054838.15588@bingvaxu.cc.binghamton.edu>, by kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell):
> 
> I found one little problem with my algebra & understanding.
> Delta modulation takes N bits into O(N) bits, NOT less. 
> 
> The theoretical compression factor for my binary strings example
> would therefore be
> 
> 	2p(1-p)
> 
> where p is the proportion of 0's (or 1's -- it doesn't matter).
> 
> For /usr/dict/words the proportion of 1's is about 56% (only
> considering the lsb 7 bits of each char). This should give
> a compression factor of about 49% with the method outlined.
> 
> Tables of random ascii digits contain about .38 1's (taking only lsb 7
> bits) which leads to a similar .47 compression factor. If we massage
> digit strings to 4-bit groups (by subtracting the '0') we find only
> about 6% 1's and therefore the compression factor would be .1 (i.e. a 1Mb 
> file of random digits -> 100Kb).
> 
> Again, if there are any glaring errors, please comment.
> 

Glaring error #1:
  if maximum compression of a binary string is 2p(1-p), what is maximum
compression of the binary data...

 01010101010101010101010101010101

Maximum compression is defined by entropy of the file, or in other words,
given a file, how many bits is required to determine uniquely the next
bit of information.  What is hoped for is that the number of bits is less
than 1 for the files you want.  Dr. Dobbs had some articles that talked
about compression in #173, Feb. 1991.  I can give you some of the references...

  Huffman, D.A.
  "A Method for the Construction of Minimum Redundancy Codes."
  _Proceeding of Institute of Electrical and Radio Engineers_ 40 (9) 1098-1101
        September, 1952

  Shannon, C.E. & W. Weaver
  _The Mathematical Theory of Communication_
  Urbana, Illinois:  Univ. of Illinois Press, 1949

  Lelewer, Debra A.  and  Hirschberg, Daniel S.
  "Data Compression."
  _ACM Computing Surveys_, vol. 19, no. 3 New York, September, 1987

For more, I suggest you look at the Dr. Dobbs Journal I mentioned.

--
   Trent Tobler - ttobler@csulx.weber.edu

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/27/91)

In article <1991Mar26.220519.1242@unislc.uucp> ttobler@unislc.uucp (Trent Tobler) writes:
>From article <1991Mar25.054838.15588@bingvaxu.cc.binghamton.edu>, by kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell):
>> would therefore be
>> 	2p(1-p)
>> where p is the proportion of 0's (or 1's -- it doesn't matter).
>Glaring error #1:
>  if maximum compression of a binary string is 2p(1-p), what is maximum
>compression of the binary data...
> 01010101010101010101010101010101
[further explanation and some references]

Thanks for your comments Trent.

I must be doing _something_ right -- 1/2 the respondents say ``too low''
and the other half say ``too high''. :-)

Perhaps I was a little obscure with the subject line or something, but I
have received a lot of email regarding examples similar to the above.

What I'll do here is (a) show you some _measurements_, and then (b)
present a very informal argument.  (I will attempt to show that
EVEN THE 01 EXAMPLE ABOVE, can not generally be represented in much
fewer than a given number of bits, despite claims that it comes
`for free').

PART 1 -- some numbers.
======================

After scrounging around all over my disks I came up with about 100
files with an equal (i.e. balance between 0.45 and 0.55) number of
0's and 1's. I then ran `compress' over them and determined the
average factor of sizes after/before. It worked out about 67% over all 
attempts; about 56% percent in those cases where it was actually
successful. A log file is appended, for those that may like to analyse it 
(probably there's plenty it can tell you about my disks, also plenty it might 
tell you about how `random' average text files of various programming languages 
may behave).

PART 2 -- an argument.
=====================

Let's take the 01 example above. (I have also received much email of similar
ilk). It is claimed that this sequence can be represented in a small number 
of bits. But this can not _necessarily_ be done.

Let me ask the question -- how _long_ is this sequence of 01's? If, let's say, 
the sequence of 01's is 2^1000000 bits long we must encode the _length_ as 
part of the compressed code. Representing this length does not come for free. 
We might for example compute the length as a binary number and then use some 
compression technique to compress _that_ into a compressed format. Eventually 
this kind of `iterated compression' comes to a fixpoint.

Let's just say, without proof, that to encode the 01 sequence repeated N times 
takes O(log N) bits. I understand this point is covered in the Shannon paper 
that Trent indicated.

This is obviously much less than 1/2 as N increases, but it _is_ only a 
single example. With the technique above we can _only_ represent 01 sequences. 
To represent other simple cycles will require more bits to `name' the 
particular cycle (possibly in a `deeper' compressed format).

As Dan Bernstein pointed out in a previous article, and is discussed
briefly in Knuth v2, there are a very large number :-) of strings
that CAN NOT be represented by ANY program shorter than themselves.
I think there are sufficient of these degenerate examples to bring any
average figure well up past the (only APPARENT, from the 01 example
above) log N/N limit.

CONCLUSION
==========

My little formula was not intended to EQUAL the compression in every case, 
for this is obviously not possible (it can not be a function of just p
for one thing).  Nor was it intended as an AVERAGE over EVERY case. It was 
intended as a ballpark figure after making certain assumptions about the 
specific underlying compression technique. (I am rather pleased, however, with
its predictive ability in the p=1/2 case as illustrated for LZ compression, 
below).

Cheers,

-kym
===

BTW, some people around SUNY-B claim I do such measurements FIRST,
then come up with a theory and subsequently produce them at an opportune
moment -- this is usualy not the case, however. I almost NEVER have a 
theory :-).

=================LOG FILE====================

Filename	after size/before size
		using Sun LZ `compress'

:mkshortnames	0.524213
:techinstall	0.751825
ApolloGraphicsMakefile	0.346165
CONTENTS	0.629024
Comp.lang.prolog	0.567824
DPram2.1	1
DPram2.1i	1
DProm.1	1
DProm.1i	1
INDEX	0.771619
MKNEW	0.646409
MKOLD	0.644444
Makefile.vax	0.399232
PORT.LOG	0.63328
READ_ME	0.633803
RR-INRIA-1320.dvi	0.549019
ReadMe	0.621053
abc10201.zip	1
apollo.c	0.610215
cal_findresist.1	1
cal_makefile.1	1
change.l	0.674312
chconfig.1	0.690476
chconfig.1i	0.690476
cman	0.911392
comp.sources.index	0.527516
deutsch.net	0.210469
displays.proto	0.836957
dpram2.1	1
dpram2.1i	1
dprom.1	1
dprom.1i	1
drc.1	1
drc.1i	1
esim.1	0.511555
espresso.1	0.571458
esptype.h	0.357095
ex.rc	1
exrc	1
ext.h	0.403077
extio.h	1
extra+.c	1
filehead.h	0.603837
gemini.1	0.516526
gen_control.1	0.550769
gen_control.1i	0.550769
geometry.3	0.495135
grApollo1.old	0.44325
grApollo4.c	0.522812
grApolloInt.h	0.622093
grOmegaInt.h	0.694301
grSunInt.h.old	0.595443
gram	0.627575
if.how	1
install	1
laygen.1	1
les.txt	0.541202
m.bat	1
m2cmp20.zip	1
m2doc20.zip	1
m2exa20.zip	1
m2lib20.zip	1
m2utl20.zip	1
magic.1	0.417558
magicutils.3	0.609205
makewhatis.csh	0.717697
mkdisttape	0.599204
mod2txt.arc	1
nc.1	0.589175
ndep.down.out	0.909836
ndep.up.out	1
nenh.down.out	0.396806
nenh.up.out	0.422425
nenhp.down.out	0.4109
netgen.1	1
netlist.1	0.598494
netlist.1i	0.598494
nload.up.out	1
nsuper.down.out	0.900826
nsuper.up.out	1
ohmics.1	0.532989
om-esubs.c	0.453863
om-subs.h	0.497336
out_struct.h	0.641509
panda.h	0.482293
paslib.i	0.428077
paslib2.i	0.42933
paslib3.i	0.427255
paslib4.i	0.428455
pattern.c	0.588988
plowDebugInt.h	0.691729
presim.1	0.636028
presim.1i	0.636028
procs.h	0.349669
prolog.lib	0.540272
rofix	1
route1.net	0.742857
route2.net	0.742857
rsleeper	0.893617
rsleeper.csh	0.893617
runstats.3	0.602226
s.Makefile	0.425905
scanner.h	0.714286
select.h	0.607029
showcdrcs.1	1
showcdrcs.1i	1
showsdrcs.1	1
showsdrcs.1i	1
signals.h	0.754472
simsort.1	1
spice_cor.1	0.544245
statlib.index	0.619519
tags	0.433903
test4.m	0.402351
test6.m	0.407745
tlogic.c	0.520635
ttable.c	0.164343
tut7b.net	0.625731
tut7d.net	0.625731
util.c	0.620143
valtbs.1	1
vaxu_unsw_gcc.bench	0.949367
win.c	0.584328
win.c-	0.572215
win.unx	0.584328
wireUndo.c	0.48959

geo avg of all=0.665099
geo avg of `successful' cases= 0.557706

====================END END END====================

jj@alice.att.com (jj, like it or not) (03/29/91)

Please, folks, you're confusing several issues here.

First, the question of "how much can I compress X" must
be qualified.  If you have a historyless compression,
you can compress up to the entropy ( sum of -p log p) bound,
where p is the probability of each token.

If you have history in your compression algorthm, you may (or
may not) be able to do better, depending on the randomness
of your token distribution (even given a fixed distribution of
tokens).

You must separate the two issues, and be sure to mention which
(or what) bound you refer to.
-- 
       -------->From the pyrolagnic keyboard of jj@alice.att.com<--------
Copyright alice!jj 1991,  all rights reserved,  except transmission  by USENET and
like free facilities granted.  Said permission is granted only for complete copies
that include this notice.    Use on pay-for-read services specifically disallowed.

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/29/91)

In article <20135@alice.att.com> jj@alice.UUCP (jj, like it or not) writes:
>Please, folks, you're confusing several issues here.

Who/what/where -- _ME_ -- confused??!! :-)

>First, the question of "how much can I compress X" must
>be qualified.  If you have a historyless compression,
>you can compress up to the entropy ( sum of -p log p) bound,
>where p is the probability of each token.
>
>If you have history in your compression algorthm, you may (or
>may not) be able to do better, depending on the randomness
>of your token distribution (even given a fixed distribution of
>tokens).

Ok, I remember something of this ilk ``p log p'' stuff from Shannon
and his buddies, but I'm not too sure whether you (or he, or whoever) is 
saying historyless compression can not, no way, ever be LESS than this. 
But it _seems_ like you are saying this is your understanding, doesn't it?

If so, I have a question and another question.

Consider the scheme I outlined earlier -- of comparing an input bitstream with 
a random bitstream and just storing the differences in delta form (e.g. 
fixed-length or Huffman coded). This doesn't _appear_ to use history -- the 
next bit output by an uncompressing program doesn't use the previous bits of 
the uncompressed string to obtain the next bit. The compressing program
doesn't build up any dictionary. Or does it? (That's Q1).

There _is_ however, a quibble about a few hidden bits -- e.g. the length
of the random generator parameters and the compress/uncompress program
code. 

The program, however, is finite. If we make the input string large
enough the program length and any fiddling little parameters can be
swamped by the length of the string itself.

Now consider the following data obtained from an experimental
compression program based on the idea above:

====================
h.dat
nb=32400		<--- number of bits read
n1=0.341605		<--- fraction of 1's
-(p log p + q log q)=0.642094	<---	limit 1
-(p lg p + q lg q)=0.926346	<---	maybe this is log_2 instead of log_e?
2pq=0.449822		<--- my limit (sure looks low, doesn't it?)
minlen=20625		<--- after some trials this is the MINIMUM LENGTH 
				output string
compression=0.636574	<--- less than - sum p ln p

n.dat
nb=8080
n1=0.336015
-(p log p + q log q)=0.638357
-(p lg p + q lg q)=0.920954
2pq=0.446218
minlen=5028
compression=0.622277	<--- less than - sum p ln p

n2.dat
nb=16160
n1=0.334653
-(p log p + q log q)=0.637425
-(p lg p + q lg q)=0.91961
2pq=0.445321
minlen=10042
compression=0.621411	<--- less than - sum p ln p

n4.dat
nb=32320
n1=0.334653
-(p log p + q log q)=0.637425
-(p lg p + q lg q)=0.91961
2pq=0.445321
minlen=20088
compression=0.621535	<--- less than - sum p ln p
====================

Here are 4 data files compressed to LESS THAN the limit quoted above.
Certainly not _substantially less_, but less nevertheless. 8-)

The files aren't big. But they do vary an order of magnitude in
size or so. I therefore presume we _could_ arrange to have the data file
size -> inf. In that case, even including the size of the
compress/decompress program, we have case(s) where the ``p log p'' limit
seems to be beaten.

What gives? Perhaps there are additional O(1) factors involved in the 
``p log p'' limit? (That's Q2).

-kym

BTW, the data files are hex and decimal digits with about 50% ascii 0's.
The balance of 0 vs 1 *bits* is given in each case above.

gwyn@smoke.brl.mil (Doug Gwyn) (03/29/91)

In article <18263:Mar2518:10:4091@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
>... and there is no way they can all be compressed to length n,
>as the method in question claims to do.

A shorter argument is to note that any invertible method that maps the
domain onto the range, which is known to map at least one point to a
shorter representation, MUST map at least one other point to a longer
representation.  (I think this was noted somewhere in Knuth.)

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/29/91)

In article <20135@alice.att.com> jj@alice.UUCP (jj, like it or not) writes:
>First, the question of "how much can I compress X" must
>be qualified.  If you have a historyless compression,
>you can compress up to the entropy ( sum of -p log p) bound,
>where p is the probability of each token.

Here is some more data and another argument for your amusement.

[If you're well up on info thy you may as well `n' here].

The p log p ``bound'' is actually a disguised expectation value;
it is an _average_. It denotes the average amount of information
per symbol in a string. For example, if the string consists of
only the symbols 0 and 1 we write -(p lg p +(1-p) lg (1-p)) for
the average amount of information (in bits) per 0/1 symbol.  [lg(x) is log 
base 2 -- the size of the alphabet in this case].  If this average is 0.5, 
say, then we might expect that we could compress the string to 1/2 its 
size in some fashion without losing any ``information''since each 0/1 
symbol in the string would only be representing 0.5 of a bit on average.

(This measure does not take into account the _sequence_ of 0's
and 1's in the string; we're actually treating the string as a
multiset.)

Since we're talking about an _average_ it is quite possible to have
particular cases where the ``compression factor'' is less. Averages,
after all, are statistical variates and have distributions of their own.

So what is the distribution of ``p lg p'' compression factor? Rather hard to 
write in closed form as far as I know, so let's look at a table. The following
shows for each ``compression factor'' the likelyhood that its
value is less than `x'. 

x	Pr(X<x)
0.1	0.0251
0.2	0.0624
0.3	0.1072
0.4	0.1578
0.5	0.2199
0.6	0.2964
0.7	0.3813
0.8	0.4853
0.9	0.6331

E(X)	0.720872
var(X)	0.0730559

For example, a file can be compressed to 50% or less about 22% of the time 
(presuming the strings being used come uniformly from all possible strings). 
We also see the ``expected'' value for the ``p lg p'' is 72%. We therefore 
might expect, on average, only to be able to compress a file to 72% of its 
original size, all things being equal.

Looks bad? We can do MUCH, MUCH better. 

Order statistics is a relatively small branch of the subject that deals
with things sorted or arranged in order. It allows us to answer such
questions as ``if I pick up N candy bars, what is the largest one
I might expect to get'' (given that candy bars are not all exactly the
same size)? 

In our case we would like to know ``if I compressed my file with 
N different techniques, what might be the BEST I could expect (i.e. the 
_smallest_ result)''?

The tables at the endf show how taking the minimum of several
values of ``p lg p'' can quickly reduce the expected value for
a _set_ of compression techniques.

Whereas a file might be compressed to 1/2 its size only 22% of the time using 
a SINGLE technique, using TWO such methods will 1/2 the file 
39% of the time. Using 9 techniques will do so 89% of the time, etc.

Obviously this is a GOOD THING if you can afford the computer
time. Various ideas obviously present themselves.  You might run several 
techniques in parallel and only output the one that seems to be performing the 
best (a little code value in the output might be nice to indicate when you 
switch methods :-).

Another idea is to simply HASH your file by XOR'ing it with 10 different
pseudo-random generators and compress each with the SAME technique
and take the smallest result. More than about 90% of the time you should
get a 50% compression.

My little technique is similar to this last -- keep compressing the string 
using different parts of the SAME random-number sequence. To do this
you require a generator that has no correlations. (One of the usual mult
cong techniques will probably not suffice).

If you now include _HISTORY_ information (i.e. knowledge about the
sequence of symbols in the string) you should be able to do even better
still, on average.

However -- a reality check. There are SOME cases, and they may not be
rare enough, where the string has the property whereby it can NOT
be compressed no matter WHAT the technique. (I rather like the neat
argument posted by Doug :-).

For those interested, my little tables follow.

Cheers all,

-kym
===

BTW, a question. How much computation do I need to perform to
guarantee my previous ``2pq'' result >99% of the time?

Probabilities that expected compression factor -(p lg p + (1-p) lg (1-p))
is less than a given value `x' by selecting the minimum result produced
by several statistically independent techniques.

x	Pr(min{X1,...X2}<x)	Two techniques
0.1	0.04957
0.2	0.120906
0.3	0.202908
0.4	0.290699
0.5	0.391444		39% of the time compression < 1/2
0.6	0.504947
0.7	0.61721
0.8	0.735084
0.9	0.865384

x	Pr(min{X1,...X3}<x)	Three techniques
0.1	0.0734258
0.2	0.175762
0.3	0.288356
0.4	0.402627
0.5	0.525265		53% of the time compression < 1/2
0.6	0.651681
0.7	0.763168
0.8	0.863648
0.9	0.95061

x	Pr(min{X1,...X4}<x)	Four techniques
0.1	0.0966828
0.2	0.227194
0.3	0.364645
0.4	0.496892
0.5	0.62966			63%
0.6	0.754923
0.7	0.853472
0.8	0.929819
0.9	0.981879

x	Pr(min{X1,...X5}<x)	Five techniques
0.1	0.119356
0.2	0.275417
0.3	0.432755
0.4	0.576283
0.5	0.711097		71%
0.6	0.827564
0.7	0.909343
0.8	0.963878
0.9	0.993351

x	Pr(min{X1,...X6}<x)	Six techniques
0.1	0.14146
0.2	0.320631
0.3	0.493563
0.4	0.643145
0.5	0.774627		77%
0.6	0.878674
0.7	0.943911
0.8	0.981408
0.9	0.997561

x	Pr(min{X1,...X7}<x)	Seven techniques
0.1	0.16301
0.2	0.363024
0.3	0.547853
0.4	0.699457
0.5	0.824187		82%
0.6	0.914635
0.7	0.965297
0.8	0.990431
0.9	0.999105

x	Pr(min{X1,...X8}<x)	Eight techniques
0.1	0.184018
0.2	0.402771
0.3	0.596324
0.4	0.746883
0.5	0.862848		86%
0.6	0.939937
0.7	0.97853
0.8	0.995075
0.9	0.999672

x	Pr(min{X1,...X9}<x)	Nine techniques
0.1	0.204499
0.2	0.440038
0.3	0.639598
0.4	0.786825
0.5	0.893008		89%
0.6	0.95774
0.7	0.986716
0.8	0.997465
0.9	0.99988

jj@alice.att.com (jj, like it or not) (03/30/91)

In article <1991Mar29.031127.9128@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
>In article <20135@alice.att.com> jj@alice.UUCP (jj, like it or not) writes:

>Consider the scheme I outlined earlier -- of comparing an input bitstream with 
>a random bitstream and just storing the differences in delta form (e.g. 
>fixed-length or Huffman coded). This doesn't _appear_ to use history -- the 
>next bit output by an uncompressing program doesn't use the previous bits of 
>the uncompressed string to obtain the next bit. The compressing program
>doesn't build up any dictionary. Or does it? (That's Q1).
What you're doing is building a model.  If your model fits, you can
do better than a historyless method.  The model itself is the
history, you have to choose the model (you call it a random number
generator, stoichastic methods, LPC methods, and such suggest other
approaches) from the knowledge of the signal.  That's your
history.

If your model (whatever it is) doesn't fit, you'll likely need
MORE bits.

>-(p lg p + q lg q)=0.926346	<---	maybe this is log_2 instead of log_e?
Always log base 2.  Not ever log base e.  Sorry, thought that was
obvious from context, since we're talking about bits here.



>What gives? Perhaps there are additional O(1) factors involved in the 
>``p log p'' limit? (That's Q2).
Anytime you have a model of some sort, you aren't doing historyless
coding.  You are assuming SOMETHING about the signal to be coded
beyond the single-token distribution.
-- 
       -------->From the pyrolagnic keyboard of jj@alice.att.com<--------
Copyright alice!jj 1991,  all rights reserved,  except transmission  by USENET and
like free facilities granted.  Said permission is granted only for complete copies
that include this notice.    Use on pay-for-read services specifically disallowed.

jj@alice.att.com (jj, like it or not) (03/30/91)

Aww, come on.  rather than use 10 methods, send some side info
that paramaterizes the data you have.  It's not hard or
expensive to send the whole (*&( codebook, for heaven's sake,
as part of the message.
-- 
       -------->From the pyrolagnic keyboard of jj@alice.att.com<--------
Copyright alice!jj 1991,  all rights reserved,  except transmission  by USENET and
like free facilities granted.  Said permission is granted only for complete copies
that include this notice.    Use on pay-for-read services specifically disallowed.

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/02/91)

In article <20144@alice.att.com> jj@alice.UUCP (jj, like it or not) writes:
>What you're doing is building a model.  If your model fits, you can
>do better than a historyless method.  The model itself is the
>history, you have to choose the model (you call it a random number
>generator, stoichastic methods, LPC methods, and such suggest other
>approaches) from the knowledge of the signal.  That's your
>history.

I'm not sure I agree with you.

The little program I used previously to demonstrate that `p lg p' could be 
beaten by an historyless technique does not, to my mind, build any model of 
the data.

It does not examine the data to `fit' the random number generator to it 
(although I have others that do this too). It just compares the input with a 
bit stream and records where they're different and stores the `deltas'
between these positions in the output. Nothing is kept to `improve' anything 
as we go along; nothing is assumed before we start (the random bit
generator was selected at random). 

I guess the mere _selection_ of the random bit generator _may_ be considered 
building a model of the data -- in which case _any_ compression technique 
selects such a model & there _is_ no historyless compression (as you
seem to imply that model-building and historylessness are exclusive).

But to get back to my original question.

In article <20135@alice.att.com> jj@alice.att.com (jj, like it or not) writes:
>First, the question of "how much can I compress X" must
>be qualified.  If you have a historyless compression,
>you can compress up to the entropy ( sum of -p log p) bound,
                                                      ^^^^^^
>where p is the probability of each token.

Is p lg p a hard-and-fast bound or not? I still think it's an average.

-kym

weverka@spot.Colorado.EDU (Robert T. Weverka) (04/02/91)

In article <1991Apr2.034441.28170@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
...heavily edited...
>The little program I used previously to demonstrate that `p lg p' could be 
>beaten by an historyless technique does not, to my mind, build any model of 
>the data.
>
>In article <20135@alice.att.com> jj@alice.att.com (jj, like it or not) writes:
>>First, the question of "how much can I compress X" must
>>be qualified.  If you have a historyless compression,
>>you can compress up to the entropy ( sum of -p log p) bound,
>                                                      ^^^^^^
>>where p is the probability of each token.
>
>Is p lg p a hard-and-fast bound or not? I still think it's an average.
>
>-kym

It is an Expectation value.   log(1/p) is the number of bits required for
each symbol.  The Expectation value of this for different probabilities
is   E(log(1/pi))  (Read pi as p sub i)

E(log(1/pi))= ( sum of -p log p)

jj@alice.att.com (jj, like it or not) (04/03/91)

In article <1991Apr2.034441.28170@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
>In article <20144@alice.att.com> jj@alice.UUCP (jj, like it or not) writes:
>I guess the mere _selection_ of the random bit generator _may_ be considered 
>building a model of the data -- in which case _any_ compression technique 
Exactly.
>selects such a model & there _is_ no historyless compression (as you
>seem to imply that model-building and historylessness are exclusive).
No, you have a model that subsumes history, in that the RNG
has predictable outputs, and you use it as a filter.
That is different than knowing the single-token probabilities
alone.

>Is p lg p a hard-and-fast bound or not? I still think it's an average.

It's ALWAYS the average rate.  Given that you're using compression,
you can't measure any other thing, unless you have a fixed message
set, and that's all.  (In which case you should enumerate your
messages, perhaps huffmanwise...)  If you have  any set of data
with the overall 'p' known, that is the average rate for THE WHOLE
SET. Parts of it may do better, because they may hit shorter
symbols (for instance), but if you look at the probabilities
of THAT PART, you should find that you don't do better than
the p log p chosen over your locality, without a model.

I think you need to revisit the basic statistical assumptions
of compression.
-- 
       -------->From the pyrolagnic keyboard of jj@alice.att.com<--------
Copyright alice!jj 1991,  all rights reserved,  except transmission  by USENET and
like free facilities granted.  Said permission is granted only for complete copies
that include this notice.    Use on pay-for-read services specifically disallowed.

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/03/91)

In article <20159@alice.att.com> jj@alice.UUCP (jj, like it or not) writes:
[I wrote]
>>Is p lg p a hard-and-fast bound or not? I still think it's an average.

>It's ALWAYS the average rate.  Given that you're using compression,
>you can't measure any other thing, unless you have a fixed message
>set, and that's all.  (In which case you should enumerate your
>messages, perhaps huffmanwise...)  If you have  any set of data
>with the overall 'p' known, that is the average rate for THE WHOLE
>SET. Parts of it may do better, because they may hit shorter
>symbols (for instance), but if you look at the probabilities
>of THAT PART, you should find that you don't do better than
>the p log p chosen over your locality, without a model.

Ta.

-kym

chris@ug.cs.dal.ca (Chris Robertson) (04/03/91)

In article <15626@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:

>                              ...  any invertible method that maps the
> domain onto the range, which is known to map at least one point to a
> shorter representation, MUST map at least one other point to a longer
> representation.  (I think this was noted somewhere in Knuth.)

Wouldn't it depend on the characteristics of the data to be compressed?
Under certain restrictions, ie, if there is redundancy built into the domain,
then why _couldn't_ you get a mapping such that every element of the domain
had a correlate shorter than itself?

 - chris

gwyn@smoke.brl.mil (Doug Gwyn) (04/03/91)

In article <1991Apr3.001158.12011@cs.dal.ca> chris@ug.cs.dal.ca (Chris Robertson) writes:
>Wouldn't it depend on the characteristics of the data to be compressed?

The domain was assumed to be that of all N-bit messages.

readdm@ccwf.cc.utexas.edu (David M. Read) (04/04/91)

In article <> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
>But to get back to my original question.
>
>Is p lg p a hard-and-fast bound or not? I still think it's an average.
>
>-kym

Speaking as a physicist (who may be a little naive when it comes to
"information science"), if sum (p lg p) does indeed represent the
total entropy of the ensemble (a proposition which I am not sure I
trust entirely, but that's beside the point), then that is indeed the
limit of what your compressor can do.  You may not violate the Second
Law!

You can get arbitrarily close to that sum, but you may not go under
it, unless you *increase* entropy somewhere else.  Perhaps creating a
"history" (I missed that part of the discussion, I'm afraid) "creates"  
some entropy, which would allow you to dip below the initial entropy
of the ensemble, but it seems to me that doing that would also
increase the number of bit which you need to transmit...so where's the
"win?"

jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) (04/04/91)

From article <46618@ut-emx.uucp>,
by readdm@ccwf.cc.utexas.edu (David M. Read):

>>Is p lg p a hard-and-fast bound or not? I still think it's an average.
> 
> You can get arbitrarily close to that sum, but you may not go under
> it, unless you *increase* entropy somewhere else.

Right, but, speaking as a Computer Scientist with a degree in Physics ...

Information theoretic entropy is mathematically identical to quantum
mechanical entropy.  I think that this is one of the biggest surprises
in 20th century science, given that Shannon didn't set out to apply
statistical thermodynamics to information flow problems.  Once he
recognized that the mathematics was the same, he was quite happy to
borrow names like entropy for the information theoretic quantities
that he needed to name.

Back to the subject, the entropy of a message, is a hard and fast bound
on the compression only in a statistical sense.  If your compression
algorithm compresses some message to fewer bits than the entropy
indicates, then your algorithm must compress some other message to more
than p log p bits.  This is the tradeoff that parallels the second law
of thermodynamics.

In information theory, the entropy of a message can only be measured
relative to a statistical model of the source.  For example, we can use
observed letter frequencies to build a model for English, but having
done so, we have no guarantee that the language used from that point on
will correspond to the frequencies we measured.

Also, better source models, for example, letter pair frequencies or word
frequencies, lead to lower entropy values.

					Doug Jones
					jones@cs.uiowa.edu

victor@watson.ibm.com (Victor Miller) (04/04/91)

There are two kinds (at least) of information: statistical, of which
entropy is the answer (only in an expected sense), and computational,
as in Kolmogorov-Chaitin-Martin-Lof, etc.  My favorite example to get
people thinking is from a paper of Ziv and Lempel:  Construct a
sequence of bits as follows: first write down in order all 1 digit
binary numbers (with leading zeroes if they are there), then all two
digit, etc.  The surprising fact that they prove is that this sequence
is incompressible by ANY finite-state compressor (of which Huffman, or
arithmetic coders are good examples), but it is rather easy to see
that an efficient way of transmitting this sequence is to send a
program to generate it, along with the length of the sequence.
--
			Victor S. Miller
			Vnet and Bitnet:  VICTOR at WATSON
			Internet: victor@watson.ibm.com
			IBM, TJ Watson Research Center

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/05/91)

In article <VICTOR.91Apr4102353@irt.watson.ibm.com> victor@watson.ibm.com writes:
>people thinking is from a paper of Ziv and Lempel:  Construct a
>sequence of bits as follows: first write down in order all 1 digit
>binary numbers (with leading zeroes if they are there), then all two
>digit, etc.  The surprising fact that they prove is that this sequence
>is incompressible by ANY finite-state compressor (of which Huffman, or
>arithmetic coders are good examples), but it is rather easy to see
>that an efficient way of transmitting this sequence is to send a
>program to generate it, along with the length of the sequence.

Hmmm, interesting.

I presume the same is true for all ``normal'' sequences over a given
alphabet?

-kym