kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/25/91)
I'm investigating a random number technique to compress text. It runs along the following lines. Some strings have `low complexity' in the sense of Martin-Lo{f. (The program that can type them out is much shorter than the actual string). For such strings it is obviously better to store or transmit the _program_ than the string. Some method needs to be found to enumerate the possible programs useful for this purpose. I am thinking of using strings generated by pseudo-random means (I understand in similar vein to LPC coding). The message is first analyzed to determine some basic statistics (e.g. frequencies of groups of bits of lengths 1 to N) and this is used to select a generator. The generator will generate strings with groups in the appropriate proportions -- this sequence is compared with the original message and only the `differences' saved. The idea is the generator will be able to create a string that is very similar to the original and thereby reduce the number of differences that need to be stored/sent. Beside a great deal of freedom with generating the random strings, another field of experimentation seems to be the method of _comparison_. Some kinds of `differences' will be more sensitive than others to certain types of noise. For example, if the `random string' is: 010110111011101111011111 and the original message was: 000010110111011110111110 there are 9 differences using `phase 0' but none using `phase 3'. What kind of compression factors might be expected? Let's take a simple case of bitstrings. Suppose we have a non-random string with p 1's and (1-p) 0's. If we generate a random string of bits with this same proportion of 1's and 0's in how many places will they differ? If we call the message A and the random string B we have: Prob(A=1 & B=0 | A=0 & B=1) = Prob(A=1)Prob(B=0) + Prob(A=0)Prob(B=1) Independence is presumed (which is not absolutely true since we analyze the message string to _set up_ the generator). We then find Prob(difference) = 2p(1-p) In a message of N bits we would then find 2p(1-p)N differences on average. We could adjust a phase parameter to further minimize this number of differences, but the O(N) dependency would remain (I _think_). To send all the `differences' we just encode the sequence of i_j where A_i_j != B_{i_j+k}. This can be done using `delta modulation' (i.e. the differences of the i_{j+1} and i_j are encoded). The average size of each difference will be O(log N/N) bits each. We therefore have a total compressed string of ~ 2p(1-p)N log N/N = 2p(1-p) log N (ignoring the parameters required to reconstitute the message; these would also need to be sent but are of O(log N) size anyway) and the compression ratio: |compressed string| --------------- |original string| ~ 2p(1-p) log N / N I presume the constant of proportionality will be fairly large since with the following strings I get only 80-90% compression factors (using the best of 100 random bit generators chosen at random :-): abcdefghijklmnopqrstuvwxyz<NEWLINE> 189 bits max compression=0.873016 this is a test message of a very short string<NEWLINE> 322 bits max compression=0.813665 These small examples are not very impressive, but the complexity of the method as N->inf is interesting (log N/N). Any comments/suggestions? (Please be nice if I've made another of my classic tremendous goofs :-). -kym
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/25/91)
I found one little problem with my algebra & understanding. Delta modulation takes N bits into O(N) bits, NOT less. The theoretical compression factor for my binary strings example would therefore be 2p(1-p) where p is the proportion of 0's (or 1's -- it doesn't matter). For /usr/dict/words the proportion of 1's is about 56% (only considering the lsb 7 bits of each char). This should give a compression factor of about 49% with the method outlined. Tables of random ascii digits contain about .38 1's (taking only lsb 7 bits) which leads to a similar .47 compression factor. If we massage digit strings to 4-bit groups (by subtracting the '0') we find only about 6% 1's and therefore the compression factor would be .1 (i.e. a 1Mb file of random digits -> 100Kb). Again, if there are any glaring errors, please comment. -kym
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/25/91)
In article <1991Mar25.054838.15588@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: > The theoretical compression factor for my binary strings example > would therefore be > 2p(1-p) > where p is the proportion of 0's (or 1's -- it doesn't matter). There is no way that a compressor can turn every string with as many 0's as 1's into a string of half the length. ---Dan
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/26/91)
In article <15098:Mar2512:53:1291@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > There is no way that a compressor can turn every string with as many 0's > as 1's into a string of half the length. To forestall argument: There are (2n)!/(n!)^2 bit strings of length 2n with n bits set. This exceeds (2n/e)^(2n) / (n/e)^(2n)(2\pi n)exp(1/6n), i.e., 2^(2n) / (2\pi n)exp(1/6n). So for large n there are at least 2n - 3 - log n bits of information in such strings, and there is no way they can all be compressed to length n, as the method in question claims to do. ---Dan
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/26/91)
In <1991Mar25.031214.25696@bingvaxu.cc.binghamton.edu> I wrote: >Some strings have `low complexity' in the sense of Martin-Lo{f. ^^^^^^^^^^^^^ >(The program that can type them out is much shorter than the >actual string). >For such strings it is obviously better to store or transmit the _program_ ^^^^^^^^^^^^ >than the string. Some method needs to be found to enumerate the possible >programs useful for this purpose. In <15098:Mar2512:53:1291@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) wrote: >There is no way that a compressor can turn every string with as many 0's ^^^^^ >as 1's into a string of half the length. In case I've misrepresented myself as Dan perhaps thinks, I've included some introductory text from my original message. Dan's point is of course correct -- there are certain bitstrings whose complexity prohibits any program from being written that is in any sense `shorter' than the given bitstring. Essentially the only programs that can reproduce such bitstrings are: PRINT ``SEQUENCE-OF-BITS'' However, by the definition of certain mathematicians, such bitsrings would be regarded as essentially ``random'' (see, e.g., Knuth v2 who is/was _not_ one of these) and I _think_ the number of text files, with which I am primarily concerned, that exhibit such a property are vanishingly small; text files on a computer are USUALLY generated by programs shorter than they are! -kym
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/26/91)
My last attempt to post this article with either bumped or dumped in Dan's IN tray -- I don't know which. Sorry if you've seen it before. === Thanks, Dan, for your input. In article <15098:Mar2512:53:1291@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > There is no way that a compressor can turn every string with as many 0's > as 1's into a string of half the length. In article <18263:Mar2518:10:4091@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > To forestall argument: There are (2n)!/(n!)^2 bit strings of length 2n > with n bits set. This exceeds (2n/e)^(2n) / (n/e)^(2n)(2\pi n)exp(1/6n), > i.e., 2^(2n) / (2\pi n)exp(1/6n). So for large n there are at least > 2n - 3 - log n bits of information in such strings, and there is no way > they can all be compressed to length n... While I have to agree with Dan's argument above a question that would be relevant to this group is WHAT FRACTION of strings can be compressed to strings of, say, <= 1/2 their length? Using the (possibly buggy) argument below, I find this to be ~ 1/sqrt(N). I would be interested in any refinement or other pointers. \begin{argument} One way we might approximate an answer is to consider `random' number generators (sorry, this is an old axe :-). A generator of bits, say, takes a number of parameters and can output, if properly managed, a sequence of 2^n bits. In certain circumstances we can guarantee that different n-bit parameter values will generate different bit sequences. Let's say we have a random bit generator of N bits long (in some kind of Turing/Wang formalism). It can generate of 2^N bits. By changing any of the bits in the N-bit program the sequence produced will be different. (I am talking about the limit of N here where _most_ of the N bits of generator are in fact parameters and not much is actual code where bit changes are typically `uninteresting' :-). We therefore arrive at: N-bit `program' => 2^N bits 2^N possible N-bit programs => (2^N)^2 bits Therefore the number of 2N-bit strings that can be generated (by this means) by N-bit programs is (2^N)^2/(2N) = 2^(2N-1)/N. Obviously there are a total of 2^(2N) 2N-bit strings so our random bit programs can generate the fraction: 2^(2N-1)/N ---------- 2^(2N) = 1 / 2N of them. Now, the number of 2N-bit strings with 1/2 1's= (2N)!/(N!)^2. Using the simple approx n! = sqrt(2 n pi) (n/e)^n (actually it's larger by a factor of about O(1+1/12n)) we find this last to be about: sqrt(2 2N pi) (2 N/e)^(2N) ~ -------------------------- 2 N pi (N/e)^(2N) 2^(2N) = ------ sqrt(N pi) So the fraction of 2N-bit strings that have 1/2 1's is therefore 2^(2N) ~ --------- / 2^(2N) sqrt(N pi) = 1 / sqrt(N pi) What proportion of these might our N-bit programs be capable of generating? All things being equal we find: 1/2N ~ ---------- 1/sqrt(N pi) = 1/2 sqrt(pi/N) which obviously ->0 (and is only good for large N anyway). \end{argument} If N were in the 10k's of bits (typical of files in my directory, anyway) then we would expect a 50% compression of `balanced' files using random number generators to occur in ~ 1% of cases -- not too often! As a kind of VERY ROUGH experiment, I ran all my current files through the Unix compress utility and selected those that had close to 1/2 1's. We find the following: Fraction 1's: Reduction factor: 0.506969 38.07% 0.47619 -33.33% 0.492063 -55.55% 0.491462 33.09% 0.502947 36.47% 0.503021 35.94% 0.478355 26.01% 0.492063 -55.55% 0.49483 30.57% 0.47619 -55.55% 0.510204 32.50% 0.47619 -55.55% 0.525429 31.15% 0.505414 22.98% 0.487941 12.12% 0.514909 48.48% Out of about 100 files these above contained between .48 and .52 (approx) 1's vs 0's. There were 16 attempts to compress (using the `standard' Lempel-Ziv adaptive coding). There were 5 failures. The average reduction where successful (geometric mean) was 30.04% (i.e. files compressed to about 70% of original length). Only 1 -- the last -- came _close_ to 50% reduction. Maybe a statistical accident? But then again this utility doesn't use the method I was talking about... My remaining argument, not having a fully-working version of the compression algorithm I was discussing in my original post to run on text files, must run along the lines of ``text files ain't random'' and thefore the ~ 1/sqrt(N) argument above does not apply. -kym
ttobler@unislc.uucp (Trent Tobler) (03/27/91)
From article <1991Mar25.054838.15588@bingvaxu.cc.binghamton.edu>, by kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell): > > I found one little problem with my algebra & understanding. > Delta modulation takes N bits into O(N) bits, NOT less. > > The theoretical compression factor for my binary strings example > would therefore be > > 2p(1-p) > > where p is the proportion of 0's (or 1's -- it doesn't matter). > > For /usr/dict/words the proportion of 1's is about 56% (only > considering the lsb 7 bits of each char). This should give > a compression factor of about 49% with the method outlined. > > Tables of random ascii digits contain about .38 1's (taking only lsb 7 > bits) which leads to a similar .47 compression factor. If we massage > digit strings to 4-bit groups (by subtracting the '0') we find only > about 6% 1's and therefore the compression factor would be .1 (i.e. a 1Mb > file of random digits -> 100Kb). > > Again, if there are any glaring errors, please comment. > Glaring error #1: if maximum compression of a binary string is 2p(1-p), what is maximum compression of the binary data... 01010101010101010101010101010101 Maximum compression is defined by entropy of the file, or in other words, given a file, how many bits is required to determine uniquely the next bit of information. What is hoped for is that the number of bits is less than 1 for the files you want. Dr. Dobbs had some articles that talked about compression in #173, Feb. 1991. I can give you some of the references... Huffman, D.A. "A Method for the Construction of Minimum Redundancy Codes." _Proceeding of Institute of Electrical and Radio Engineers_ 40 (9) 1098-1101 September, 1952 Shannon, C.E. & W. Weaver _The Mathematical Theory of Communication_ Urbana, Illinois: Univ. of Illinois Press, 1949 Lelewer, Debra A. and Hirschberg, Daniel S. "Data Compression." _ACM Computing Surveys_, vol. 19, no. 3 New York, September, 1987 For more, I suggest you look at the Dr. Dobbs Journal I mentioned. -- Trent Tobler - ttobler@csulx.weber.edu
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/27/91)
In article <1991Mar26.220519.1242@unislc.uucp> ttobler@unislc.uucp (Trent Tobler) writes: >From article <1991Mar25.054838.15588@bingvaxu.cc.binghamton.edu>, by kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell): >> would therefore be >> 2p(1-p) >> where p is the proportion of 0's (or 1's -- it doesn't matter). >Glaring error #1: > if maximum compression of a binary string is 2p(1-p), what is maximum >compression of the binary data... > 01010101010101010101010101010101 [further explanation and some references] Thanks for your comments Trent. I must be doing _something_ right -- 1/2 the respondents say ``too low'' and the other half say ``too high''. :-) Perhaps I was a little obscure with the subject line or something, but I have received a lot of email regarding examples similar to the above. What I'll do here is (a) show you some _measurements_, and then (b) present a very informal argument. (I will attempt to show that EVEN THE 01 EXAMPLE ABOVE, can not generally be represented in much fewer than a given number of bits, despite claims that it comes `for free'). PART 1 -- some numbers. ====================== After scrounging around all over my disks I came up with about 100 files with an equal (i.e. balance between 0.45 and 0.55) number of 0's and 1's. I then ran `compress' over them and determined the average factor of sizes after/before. It worked out about 67% over all attempts; about 56% percent in those cases where it was actually successful. A log file is appended, for those that may like to analyse it (probably there's plenty it can tell you about my disks, also plenty it might tell you about how `random' average text files of various programming languages may behave). PART 2 -- an argument. ===================== Let's take the 01 example above. (I have also received much email of similar ilk). It is claimed that this sequence can be represented in a small number of bits. But this can not _necessarily_ be done. Let me ask the question -- how _long_ is this sequence of 01's? If, let's say, the sequence of 01's is 2^1000000 bits long we must encode the _length_ as part of the compressed code. Representing this length does not come for free. We might for example compute the length as a binary number and then use some compression technique to compress _that_ into a compressed format. Eventually this kind of `iterated compression' comes to a fixpoint. Let's just say, without proof, that to encode the 01 sequence repeated N times takes O(log N) bits. I understand this point is covered in the Shannon paper that Trent indicated. This is obviously much less than 1/2 as N increases, but it _is_ only a single example. With the technique above we can _only_ represent 01 sequences. To represent other simple cycles will require more bits to `name' the particular cycle (possibly in a `deeper' compressed format). As Dan Bernstein pointed out in a previous article, and is discussed briefly in Knuth v2, there are a very large number :-) of strings that CAN NOT be represented by ANY program shorter than themselves. I think there are sufficient of these degenerate examples to bring any average figure well up past the (only APPARENT, from the 01 example above) log N/N limit. CONCLUSION ========== My little formula was not intended to EQUAL the compression in every case, for this is obviously not possible (it can not be a function of just p for one thing). Nor was it intended as an AVERAGE over EVERY case. It was intended as a ballpark figure after making certain assumptions about the specific underlying compression technique. (I am rather pleased, however, with its predictive ability in the p=1/2 case as illustrated for LZ compression, below). Cheers, -kym === BTW, some people around SUNY-B claim I do such measurements FIRST, then come up with a theory and subsequently produce them at an opportune moment -- this is usualy not the case, however. I almost NEVER have a theory :-). =================LOG FILE==================== Filename after size/before size using Sun LZ `compress' :mkshortnames 0.524213 :techinstall 0.751825 ApolloGraphicsMakefile 0.346165 CONTENTS 0.629024 Comp.lang.prolog 0.567824 DPram2.1 1 DPram2.1i 1 DProm.1 1 DProm.1i 1 INDEX 0.771619 MKNEW 0.646409 MKOLD 0.644444 Makefile.vax 0.399232 PORT.LOG 0.63328 READ_ME 0.633803 RR-INRIA-1320.dvi 0.549019 ReadMe 0.621053 abc10201.zip 1 apollo.c 0.610215 cal_findresist.1 1 cal_makefile.1 1 change.l 0.674312 chconfig.1 0.690476 chconfig.1i 0.690476 cman 0.911392 comp.sources.index 0.527516 deutsch.net 0.210469 displays.proto 0.836957 dpram2.1 1 dpram2.1i 1 dprom.1 1 dprom.1i 1 drc.1 1 drc.1i 1 esim.1 0.511555 espresso.1 0.571458 esptype.h 0.357095 ex.rc 1 exrc 1 ext.h 0.403077 extio.h 1 extra+.c 1 filehead.h 0.603837 gemini.1 0.516526 gen_control.1 0.550769 gen_control.1i 0.550769 geometry.3 0.495135 grApollo1.old 0.44325 grApollo4.c 0.522812 grApolloInt.h 0.622093 grOmegaInt.h 0.694301 grSunInt.h.old 0.595443 gram 0.627575 if.how 1 install 1 laygen.1 1 les.txt 0.541202 m.bat 1 m2cmp20.zip 1 m2doc20.zip 1 m2exa20.zip 1 m2lib20.zip 1 m2utl20.zip 1 magic.1 0.417558 magicutils.3 0.609205 makewhatis.csh 0.717697 mkdisttape 0.599204 mod2txt.arc 1 nc.1 0.589175 ndep.down.out 0.909836 ndep.up.out 1 nenh.down.out 0.396806 nenh.up.out 0.422425 nenhp.down.out 0.4109 netgen.1 1 netlist.1 0.598494 netlist.1i 0.598494 nload.up.out 1 nsuper.down.out 0.900826 nsuper.up.out 1 ohmics.1 0.532989 om-esubs.c 0.453863 om-subs.h 0.497336 out_struct.h 0.641509 panda.h 0.482293 paslib.i 0.428077 paslib2.i 0.42933 paslib3.i 0.427255 paslib4.i 0.428455 pattern.c 0.588988 plowDebugInt.h 0.691729 presim.1 0.636028 presim.1i 0.636028 procs.h 0.349669 prolog.lib 0.540272 rofix 1 route1.net 0.742857 route2.net 0.742857 rsleeper 0.893617 rsleeper.csh 0.893617 runstats.3 0.602226 s.Makefile 0.425905 scanner.h 0.714286 select.h 0.607029 showcdrcs.1 1 showcdrcs.1i 1 showsdrcs.1 1 showsdrcs.1i 1 signals.h 0.754472 simsort.1 1 spice_cor.1 0.544245 statlib.index 0.619519 tags 0.433903 test4.m 0.402351 test6.m 0.407745 tlogic.c 0.520635 ttable.c 0.164343 tut7b.net 0.625731 tut7d.net 0.625731 util.c 0.620143 valtbs.1 1 vaxu_unsw_gcc.bench 0.949367 win.c 0.584328 win.c- 0.572215 win.unx 0.584328 wireUndo.c 0.48959 geo avg of all=0.665099 geo avg of `successful' cases= 0.557706 ====================END END END====================