[alt.comp.compression] theoretical compression factor

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/25/91)

I'm investigating a random number technique to compress text.

It runs along the following lines.

Some strings have `low complexity' in the sense of Martin-Lo{f.
(The program that can type them out is much shorter than the
actual string).

For such strings it is obviously better to store or transmit the _program_ 
than the string. Some method needs to be found to enumerate the possible 
programs useful for this purpose.

I am thinking of using strings generated by pseudo-random means
(I understand in similar vein to LPC coding). The message is first analyzed 
to determine some basic statistics (e.g. frequencies of groups of bits of 
lengths 1 to N) and this is used to select a generator. The generator will 
generate strings with groups in the appropriate proportions -- this sequence 
is compared with the original message and only the `differences' saved. The 
idea is the generator will be able to create a string that is very similar to 
the original and thereby reduce the number of differences that need to be
stored/sent.

Beside a great deal of freedom with generating the random strings,
another field of experimentation seems to be the method of _comparison_.
Some kinds of `differences' will be more sensitive than others to
certain types of noise. For example, if the `random string' is:

	010110111011101111011111

and the original message was:

	000010110111011110111110

there are 9 differences using `phase 0' but none using `phase 3'.

What kind of compression factors might be expected? Let's take a simple case 
of bitstrings.

Suppose we have a non-random string with p 1's and (1-p) 0's.
If we generate a random string of bits with this same proportion of 1's 
and 0's in how many places will they differ? If we call the message A
and the random string B we have:

Prob(A=1 & B=0 | A=0 & B=1) =
	Prob(A=1)Prob(B=0) + Prob(A=0)Prob(B=1)

Independence is presumed (which is not absolutely true since we analyze the 
message string to _set up_ the generator). We then find

Prob(difference) = 2p(1-p)

In a message of N bits we would then find 2p(1-p)N differences on average. 
We could adjust a phase parameter to further minimize this number of
differences, but the O(N) dependency would remain (I _think_).

To send all the `differences' we just encode the sequence of i_j where
A_i_j != B_{i_j+k}. This can be done using `delta modulation' (i.e. the
differences of the i_{j+1} and i_j are encoded). The average size
of each difference will be O(log N/N) bits each.

We therefore have a total compressed string of

~	2p(1-p)N log N/N = 2p(1-p) log N

(ignoring the parameters required to reconstitute the message; these would 
also need to be sent but are of O(log N) size anyway) and the 
compression ratio:

	|compressed string|
	---------------
	|original string|

~	2p(1-p) log N / N

I presume the constant of proportionality will be fairly large since with the 
following strings I get only 80-90% compression factors (using the best of 
100 random bit generators chosen at random :-):

abcdefghijklmnopqrstuvwxyz<NEWLINE>
189 bits
max compression=0.873016

this is a test message of a very short string<NEWLINE>
322 bits
max compression=0.813665

These small examples are not very impressive, but the complexity
of the method as N->inf is interesting (log N/N).

Any comments/suggestions? (Please be nice if I've made another of my
classic tremendous goofs :-).

-kym

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/25/91)

I found one little problem with my algebra & understanding.
Delta modulation takes N bits into O(N) bits, NOT less. 

The theoretical compression factor for my binary strings example
would therefore be

	2p(1-p)

where p is the proportion of 0's (or 1's -- it doesn't matter).

For /usr/dict/words the proportion of 1's is about 56% (only
considering the lsb 7 bits of each char). This should give
a compression factor of about 49% with the method outlined.

Tables of random ascii digits contain about .38 1's (taking only lsb 7
bits) which leads to a similar .47 compression factor. If we massage
digit strings to 4-bit groups (by subtracting the '0') we find only
about 6% 1's and therefore the compression factor would be .1 (i.e. a 1Mb 
file of random digits -> 100Kb).

Again, if there are any glaring errors, please comment.

-kym

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/25/91)

In article <1991Mar25.054838.15588@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
> The theoretical compression factor for my binary strings example
> would therefore be
> 	2p(1-p)
> where p is the proportion of 0's (or 1's -- it doesn't matter).

There is no way that a compressor can turn every string with as many 0's
as 1's into a string of half the length.

---Dan

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/26/91)

In article <15098:Mar2512:53:1291@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> There is no way that a compressor can turn every string with as many 0's
> as 1's into a string of half the length.

To forestall argument: There are (2n)!/(n!)^2 bit strings of length 2n
with n bits set. This exceeds (2n/e)^(2n) / (n/e)^(2n)(2\pi n)exp(1/6n),
i.e., 2^(2n) / (2\pi n)exp(1/6n). So for large n there are at least
2n - 3 - log n bits of information in such strings, and there is no way
they can all be compressed to length n, as the method in question claims
to do.

---Dan

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/26/91)

In <1991Mar25.031214.25696@bingvaxu.cc.binghamton.edu> I wrote:
>Some strings have `low complexity' in the sense of Martin-Lo{f.
^^^^^^^^^^^^^
>(The program that can type them out is much shorter than the
>actual string).
>For such strings it is obviously better to store or transmit the _program_ 
     ^^^^^^^^^^^^
>than the string. Some method needs to be found to enumerate the possible 
>programs useful for this purpose.

In <15098:Mar2512:53:1291@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) wrote:
>There is no way that a compressor can turn every string with as many 0's
					    ^^^^^
>as 1's into a string of half the length.

In case I've misrepresented myself as Dan perhaps thinks, I've included some 
introductory text from my original message.

Dan's point is of course correct -- there are certain bitstrings whose 
complexity prohibits any program from being written that is in any
sense `shorter' than the given bitstring. Essentially the only programs that
can reproduce such bitstrings are:

	PRINT ``SEQUENCE-OF-BITS''

However, by the definition of certain mathematicians, such bitsrings would be 
regarded as essentially ``random'' (see, e.g., Knuth v2 who is/was _not_ one 
of these) and I _think_ the number of text files, with which I am primarily
concerned, that exhibit such a property are vanishingly small;
text files on a computer are USUALLY generated by programs shorter than they 
are!

-kym

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/26/91)

My last attempt to post this article with either bumped or dumped in
Dan's IN tray -- I don't know which. Sorry if you've seen it before.
===

Thanks, Dan, for your input. 

In article <15098:Mar2512:53:1291@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> There is no way that a compressor can turn every string with as many 0's
> as 1's into a string of half the length.

In article <18263:Mar2518:10:4091@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> To forestall argument: There are (2n)!/(n!)^2 bit strings of length 2n
> with n bits set. This exceeds (2n/e)^(2n) / (n/e)^(2n)(2\pi n)exp(1/6n),
> i.e., 2^(2n) / (2\pi n)exp(1/6n). So for large n there are at least
> 2n - 3 - log n bits of information in such strings, and there is no way
> they can all be compressed to length n...

While I have to agree with Dan's argument above a question that would be 
relevant to this group is WHAT FRACTION of strings can be compressed to 
strings of, say, <= 1/2 their length? Using the (possibly buggy) argument 
below, I find this to be ~ 1/sqrt(N). I would be interested in any refinement 
or other pointers.

\begin{argument}

One way we might approximate an answer is to consider `random' number
generators (sorry, this is an old axe :-).

A generator of bits, say, takes a number of parameters and can output, 
if properly managed, a sequence of 2^n bits. In certain circumstances we can 
guarantee that different n-bit parameter values will generate different bit 
sequences.

Let's say we have a random bit generator of N bits long (in some kind of
Turing/Wang formalism). It can generate of 2^N bits. By changing any of the 
bits in the N-bit program the sequence produced will be different. 
(I am talking about the limit of N here where _most_ of the N bits of 
generator are in fact parameters and not much is actual code where
bit changes are typically `uninteresting' :-).

We therefore arrive at:

N-bit `program'			=>	2^N bits
2^N possible N-bit programs	=>	(2^N)^2 bits

Therefore the number of 2N-bit strings that can be generated (by this 
means) by N-bit programs is (2^N)^2/(2N) = 2^(2N-1)/N.

Obviously there are a total of 2^(2N) 2N-bit strings so our random bit 
programs can generate the fraction:

	2^(2N-1)/N
	----------
	2^(2N)

=	1 / 2N

of them.

Now, the number of 2N-bit strings with 1/2 1's= (2N)!/(N!)^2.  Using the 
simple approx n! = sqrt(2 n pi) (n/e)^n (actually it's larger by a factor of 
about O(1+1/12n)) we find this last to be about:

	sqrt(2 2N pi) (2 N/e)^(2N)
~	--------------------------
	2 N pi (N/e)^(2N)

	2^(2N)
=	------
	sqrt(N pi)

So the fraction of 2N-bit strings that have 1/2 1's is therefore

	2^(2N)
~	--------- / 2^(2N)
	sqrt(N pi)

=	1 / sqrt(N pi)

What proportion of these might our N-bit programs be capable of generating?
All things being equal we find:

	1/2N
~	----------
	1/sqrt(N pi)

=	1/2 sqrt(pi/N)

which obviously ->0 (and is only good for large N anyway).

\end{argument}

If N were in the 10k's of bits (typical of files in my directory, anyway) then 
we would expect a 50% compression of `balanced' files using random number 
generators to occur in ~ 1% of cases -- not too often! 

As a kind of VERY ROUGH experiment, I ran all my current files through the Unix 
compress utility and selected those that had close to 1/2 1's. We find the 
following:

Fraction 1's:	Reduction factor:

0.506969	38.07% 
0.47619		-33.33%
0.492063	-55.55%
0.491462	33.09%
0.502947	36.47%
0.503021	35.94%
0.478355	26.01%
0.492063	-55.55%
0.49483		30.57%
0.47619		-55.55%
0.510204	32.50%
0.47619		-55.55%
0.525429	31.15%
0.505414	22.98%
0.487941	12.12%
0.514909	48.48%

Out of about 100 files these above contained between .48 and .52 (approx) 
1's vs 0's.  There were 16 attempts to compress (using the `standard' 
Lempel-Ziv adaptive coding).  There were 5 failures.  The average reduction 
where successful (geometric mean) was 30.04% (i.e. files compressed to about 
70% of original length).

Only 1 -- the last -- came _close_ to 50% reduction. Maybe a statistical
accident? But then again this utility doesn't use the method I was
talking about...

My remaining argument, not having a fully-working version of the 
compression algorithm I was discussing in my original post to run on text 
files, must run along the lines of ``text files ain't random'' and thefore 
the ~ 1/sqrt(N) argument above does not apply.

-kym

ttobler@unislc.uucp (Trent Tobler) (03/27/91)

From article <1991Mar25.054838.15588@bingvaxu.cc.binghamton.edu>, by kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell):
> 
> I found one little problem with my algebra & understanding.
> Delta modulation takes N bits into O(N) bits, NOT less. 
> 
> The theoretical compression factor for my binary strings example
> would therefore be
> 
> 	2p(1-p)
> 
> where p is the proportion of 0's (or 1's -- it doesn't matter).
> 
> For /usr/dict/words the proportion of 1's is about 56% (only
> considering the lsb 7 bits of each char). This should give
> a compression factor of about 49% with the method outlined.
> 
> Tables of random ascii digits contain about .38 1's (taking only lsb 7
> bits) which leads to a similar .47 compression factor. If we massage
> digit strings to 4-bit groups (by subtracting the '0') we find only
> about 6% 1's and therefore the compression factor would be .1 (i.e. a 1Mb 
> file of random digits -> 100Kb).
> 
> Again, if there are any glaring errors, please comment.
> 

Glaring error #1:
  if maximum compression of a binary string is 2p(1-p), what is maximum
compression of the binary data...

 01010101010101010101010101010101

Maximum compression is defined by entropy of the file, or in other words,
given a file, how many bits is required to determine uniquely the next
bit of information.  What is hoped for is that the number of bits is less
than 1 for the files you want.  Dr. Dobbs had some articles that talked
about compression in #173, Feb. 1991.  I can give you some of the references...

  Huffman, D.A.
  "A Method for the Construction of Minimum Redundancy Codes."
  _Proceeding of Institute of Electrical and Radio Engineers_ 40 (9) 1098-1101
        September, 1952

  Shannon, C.E. & W. Weaver
  _The Mathematical Theory of Communication_
  Urbana, Illinois:  Univ. of Illinois Press, 1949

  Lelewer, Debra A.  and  Hirschberg, Daniel S.
  "Data Compression."
  _ACM Computing Surveys_, vol. 19, no. 3 New York, September, 1987

For more, I suggest you look at the Dr. Dobbs Journal I mentioned.

--
   Trent Tobler - ttobler@csulx.weber.edu

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/27/91)

In article <1991Mar26.220519.1242@unislc.uucp> ttobler@unislc.uucp (Trent Tobler) writes:
>From article <1991Mar25.054838.15588@bingvaxu.cc.binghamton.edu>, by kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell):
>> would therefore be
>> 	2p(1-p)
>> where p is the proportion of 0's (or 1's -- it doesn't matter).
>Glaring error #1:
>  if maximum compression of a binary string is 2p(1-p), what is maximum
>compression of the binary data...
> 01010101010101010101010101010101
[further explanation and some references]

Thanks for your comments Trent.

I must be doing _something_ right -- 1/2 the respondents say ``too low''
and the other half say ``too high''. :-)

Perhaps I was a little obscure with the subject line or something, but I
have received a lot of email regarding examples similar to the above.

What I'll do here is (a) show you some _measurements_, and then (b)
present a very informal argument.  (I will attempt to show that
EVEN THE 01 EXAMPLE ABOVE, can not generally be represented in much
fewer than a given number of bits, despite claims that it comes
`for free').

PART 1 -- some numbers.
======================

After scrounging around all over my disks I came up with about 100
files with an equal (i.e. balance between 0.45 and 0.55) number of
0's and 1's. I then ran `compress' over them and determined the
average factor of sizes after/before. It worked out about 67% over all 
attempts; about 56% percent in those cases where it was actually
successful. A log file is appended, for those that may like to analyse it 
(probably there's plenty it can tell you about my disks, also plenty it might 
tell you about how `random' average text files of various programming languages 
may behave).

PART 2 -- an argument.
=====================

Let's take the 01 example above. (I have also received much email of similar
ilk). It is claimed that this sequence can be represented in a small number 
of bits. But this can not _necessarily_ be done.

Let me ask the question -- how _long_ is this sequence of 01's? If, let's say, 
the sequence of 01's is 2^1000000 bits long we must encode the _length_ as 
part of the compressed code. Representing this length does not come for free. 
We might for example compute the length as a binary number and then use some 
compression technique to compress _that_ into a compressed format. Eventually 
this kind of `iterated compression' comes to a fixpoint.

Let's just say, without proof, that to encode the 01 sequence repeated N times 
takes O(log N) bits. I understand this point is covered in the Shannon paper 
that Trent indicated.

This is obviously much less than 1/2 as N increases, but it _is_ only a 
single example. With the technique above we can _only_ represent 01 sequences. 
To represent other simple cycles will require more bits to `name' the 
particular cycle (possibly in a `deeper' compressed format).

As Dan Bernstein pointed out in a previous article, and is discussed
briefly in Knuth v2, there are a very large number :-) of strings
that CAN NOT be represented by ANY program shorter than themselves.
I think there are sufficient of these degenerate examples to bring any
average figure well up past the (only APPARENT, from the 01 example
above) log N/N limit.

CONCLUSION
==========

My little formula was not intended to EQUAL the compression in every case, 
for this is obviously not possible (it can not be a function of just p
for one thing).  Nor was it intended as an AVERAGE over EVERY case. It was 
intended as a ballpark figure after making certain assumptions about the 
specific underlying compression technique. (I am rather pleased, however, with
its predictive ability in the p=1/2 case as illustrated for LZ compression, 
below).

Cheers,

-kym
===

BTW, some people around SUNY-B claim I do such measurements FIRST,
then come up with a theory and subsequently produce them at an opportune
moment -- this is usualy not the case, however. I almost NEVER have a 
theory :-).

=================LOG FILE====================

Filename	after size/before size
		using Sun LZ `compress'

:mkshortnames	0.524213
:techinstall	0.751825
ApolloGraphicsMakefile	0.346165
CONTENTS	0.629024
Comp.lang.prolog	0.567824
DPram2.1	1
DPram2.1i	1
DProm.1	1
DProm.1i	1
INDEX	0.771619
MKNEW	0.646409
MKOLD	0.644444
Makefile.vax	0.399232
PORT.LOG	0.63328
READ_ME	0.633803
RR-INRIA-1320.dvi	0.549019
ReadMe	0.621053
abc10201.zip	1
apollo.c	0.610215
cal_findresist.1	1
cal_makefile.1	1
change.l	0.674312
chconfig.1	0.690476
chconfig.1i	0.690476
cman	0.911392
comp.sources.index	0.527516
deutsch.net	0.210469
displays.proto	0.836957
dpram2.1	1
dpram2.1i	1
dprom.1	1
dprom.1i	1
drc.1	1
drc.1i	1
esim.1	0.511555
espresso.1	0.571458
esptype.h	0.357095
ex.rc	1
exrc	1
ext.h	0.403077
extio.h	1
extra+.c	1
filehead.h	0.603837
gemini.1	0.516526
gen_control.1	0.550769
gen_control.1i	0.550769
geometry.3	0.495135
grApollo1.old	0.44325
grApollo4.c	0.522812
grApolloInt.h	0.622093
grOmegaInt.h	0.694301
grSunInt.h.old	0.595443
gram	0.627575
if.how	1
install	1
laygen.1	1
les.txt	0.541202
m.bat	1
m2cmp20.zip	1
m2doc20.zip	1
m2exa20.zip	1
m2lib20.zip	1
m2utl20.zip	1
magic.1	0.417558
magicutils.3	0.609205
makewhatis.csh	0.717697
mkdisttape	0.599204
mod2txt.arc	1
nc.1	0.589175
ndep.down.out	0.909836
ndep.up.out	1
nenh.down.out	0.396806
nenh.up.out	0.422425
nenhp.down.out	0.4109
netgen.1	1
netlist.1	0.598494
netlist.1i	0.598494
nload.up.out	1
nsuper.down.out	0.900826
nsuper.up.out	1
ohmics.1	0.532989
om-esubs.c	0.453863
om-subs.h	0.497336
out_struct.h	0.641509
panda.h	0.482293
paslib.i	0.428077
paslib2.i	0.42933
paslib3.i	0.427255
paslib4.i	0.428455
pattern.c	0.588988
plowDebugInt.h	0.691729
presim.1	0.636028
presim.1i	0.636028
procs.h	0.349669
prolog.lib	0.540272
rofix	1
route1.net	0.742857
route2.net	0.742857
rsleeper	0.893617
rsleeper.csh	0.893617
runstats.3	0.602226
s.Makefile	0.425905
scanner.h	0.714286
select.h	0.607029
showcdrcs.1	1
showcdrcs.1i	1
showsdrcs.1	1
showsdrcs.1i	1
signals.h	0.754472
simsort.1	1
spice_cor.1	0.544245
statlib.index	0.619519
tags	0.433903
test4.m	0.402351
test6.m	0.407745
tlogic.c	0.520635
ttable.c	0.164343
tut7b.net	0.625731
tut7d.net	0.625731
util.c	0.620143
valtbs.1	1
vaxu_unsw_gcc.bench	0.949367
win.c	0.584328
win.c-	0.572215
win.unx	0.584328
wireUndo.c	0.48959


geo avg of all=0.665099
geo avg of `successful' cases= 0.557706

====================END END END====================