kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/25/91)
I'm investigating a random number technique to compress text. It runs along the following lines. Some strings have `low complexity' in the sense of Martin-Lo{f. (The program that can type them out is much shorter than the actual string). For such strings it is obviously better to store or transmit the _program_ than the string. Some method needs to be found to enumerate the possible programs useful for this purpose. I am thinking of using strings generated by pseudo-random means (I understand in similar vein to LPC coding). The message is first analyzed to determine some basic statistics (e.g. frequencies of groups of bits of lengths 1 to N) and this is used to select a generator. The generator will generate strings with groups in the appropriate proportions -- this sequence is compared with the original message and only the `differences' saved. The idea is the generator will be able to create a string that is very similar to the original and thereby reduce the number of differences that need to be stored/sent. Beside a great deal of freedom with generating the random strings, another field of experimentation seems to be the method of _comparison_. Some kinds of `differences' will be more sensitive than others to certain types of noise. For example, if the `random string' is: 010110111011101111011111 and the original message was: 000010110111011110111110 there are 9 differences using `phase 0' but none using `phase 3'. What kind of compression factors might be expected? Let's take a simple case of bitstrings. Suppose we have a non-random string with p 1's and (1-p) 0's. If we generate a random string of bits with this same proportion of 1's and 0's in how many places will they differ? If we call the message A and the random string B we have: Prob(A=1 & B=0 | A=0 & B=1) = Prob(A=1)Prob(B=0) + Prob(A=0)Prob(B=1) Independence is presumed (which is not absolutely true since we analyze the message string to _set up_ the generator). We then find Prob(difference) = 2p(1-p) In a message of N bits we would then find 2p(1-p)N differences on average. We could adjust a phase parameter to further minimize this number of differences, but the O(N) dependency would remain (I _think_). To send all the `differences' we just encode the sequence of i_j where A_i_j != B_{i_j+k}. This can be done using `delta modulation' (i.e. the differences of the i_{j+1} and i_j are encoded). The average size of each difference will be O(log N/N) bits each. We therefore have a total compressed string of ~ 2p(1-p)N log N/N = 2p(1-p) log N (ignoring the parameters required to reconstitute the message; these would also need to be sent but are of O(log N) size anyway) and the compression ratio: |compressed string| --------------- |original string| ~ 2p(1-p) log N / N I presume the constant of proportionality will be fairly large since with the following strings I get only 80-90% compression factors (using the best of 100 random bit generators chosen at random :-): abcdefghijklmnopqrstuvwxyz<NEWLINE> 189 bits max compression=0.873016 this is a test message of a very short string<NEWLINE> 322 bits max compression=0.813665 These small examples are not very impressive, but the complexity of the method as N->inf is interesting (log N/N). Any comments/suggestions? (Please be nice if I've made another of my classic tremendous goofs :-). -kym
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/25/91)
I found one little problem with my algebra & understanding. Delta modulation takes N bits into O(N) bits, NOT less. The theoretical compression factor for my binary strings example would therefore be 2p(1-p) where p is the proportion of 0's (or 1's -- it doesn't matter). For /usr/dict/words the proportion of 1's is about 56% (only considering the lsb 7 bits of each char). This should give a compression factor of about 49% with the method outlined. Tables of random ascii digits contain about .38 1's (taking only lsb 7 bits) which leads to a similar .47 compression factor. If we massage digit strings to 4-bit groups (by subtracting the '0') we find only about 6% 1's and therefore the compression factor would be .1 (i.e. a 1Mb file of random digits -> 100Kb). Again, if there are any glaring errors, please comment. -kym
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/25/91)
In article <1991Mar25.054838.15588@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: > The theoretical compression factor for my binary strings example > would therefore be > 2p(1-p) > where p is the proportion of 0's (or 1's -- it doesn't matter). There is no way that a compressor can turn every string with as many 0's as 1's into a string of half the length. ---Dan
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/26/91)
In article <15098:Mar2512:53:1291@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > There is no way that a compressor can turn every string with as many 0's > as 1's into a string of half the length. To forestall argument: There are (2n)!/(n!)^2 bit strings of length 2n with n bits set. This exceeds (2n/e)^(2n) / (n/e)^(2n)(2\pi n)exp(1/6n), i.e., 2^(2n) / (2\pi n)exp(1/6n). So for large n there are at least 2n - 3 - log n bits of information in such strings, and there is no way they can all be compressed to length n, as the method in question claims to do. ---Dan
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/26/91)
In <1991Mar25.031214.25696@bingvaxu.cc.binghamton.edu> I wrote: >Some strings have `low complexity' in the sense of Martin-Lo{f. ^^^^^^^^^^^^^ >(The program that can type them out is much shorter than the >actual string). >For such strings it is obviously better to store or transmit the _program_ ^^^^^^^^^^^^ >than the string. Some method needs to be found to enumerate the possible >programs useful for this purpose. In <15098:Mar2512:53:1291@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) wrote: >There is no way that a compressor can turn every string with as many 0's ^^^^^ >as 1's into a string of half the length. In case I've misrepresented myself as Dan perhaps thinks, I've included some introductory text from my original message. Dan's point is of course correct -- there are certain bitstrings whose complexity prohibits any program from being written that is in any sense `shorter' than the given bitstring. Essentially the only programs that can reproduce such bitstrings are: PRINT ``SEQUENCE-OF-BITS'' However, by the definition of certain mathematicians, such bitsrings would be regarded as essentially ``random'' (see, e.g., Knuth v2 who is/was _not_ one of these) and I _think_ the number of text files, with which I am primarily concerned, that exhibit such a property are vanishingly small; text files on a computer are USUALLY generated by programs shorter than they are! -kym
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/26/91)
My last attempt to post this article with either bumped or dumped in Dan's IN tray -- I don't know which. Sorry if you've seen it before. === Thanks, Dan, for your input. In article <15098:Mar2512:53:1291@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > There is no way that a compressor can turn every string with as many 0's > as 1's into a string of half the length. In article <18263:Mar2518:10:4091@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > To forestall argument: There are (2n)!/(n!)^2 bit strings of length 2n > with n bits set. This exceeds (2n/e)^(2n) / (n/e)^(2n)(2\pi n)exp(1/6n), > i.e., 2^(2n) / (2\pi n)exp(1/6n). So for large n there are at least > 2n - 3 - log n bits of information in such strings, and there is no way > they can all be compressed to length n... While I have to agree with Dan's argument above a question that would be relevant to this group is WHAT FRACTION of strings can be compressed to strings of, say, <= 1/2 their length? Using the (possibly buggy) argument below, I find this to be ~ 1/sqrt(N). I would be interested in any refinement or other pointers. \begin{argument} One way we might approximate an answer is to consider `random' number generators (sorry, this is an old axe :-). A generator of bits, say, takes a number of parameters and can output, if properly managed, a sequence of 2^n bits. In certain circumstances we can guarantee that different n-bit parameter values will generate different bit sequences. Let's say we have a random bit generator of N bits long (in some kind of Turing/Wang formalism). It can generate of 2^N bits. By changing any of the bits in the N-bit program the sequence produced will be different. (I am talking about the limit of N here where _most_ of the N bits of generator are in fact parameters and not much is actual code where bit changes are typically `uninteresting' :-). We therefore arrive at: N-bit `program' => 2^N bits 2^N possible N-bit programs => (2^N)^2 bits Therefore the number of 2N-bit strings that can be generated (by this means) by N-bit programs is (2^N)^2/(2N) = 2^(2N-1)/N. Obviously there are a total of 2^(2N) 2N-bit strings so our random bit programs can generate the fraction: 2^(2N-1)/N ---------- 2^(2N) = 1 / 2N of them. Now, the number of 2N-bit strings with 1/2 1's= (2N)!/(N!)^2. Using the simple approx n! = sqrt(2 n pi) (n/e)^n (actually it's larger by a factor of about O(1+1/12n)) we find this last to be about: sqrt(2 2N pi) (2 N/e)^(2N) ~ -------------------------- 2 N pi (N/e)^(2N) 2^(2N) = ------ sqrt(N pi) So the fraction of 2N-bit strings that have 1/2 1's is therefore 2^(2N) ~ --------- / 2^(2N) sqrt(N pi) = 1 / sqrt(N pi) What proportion of these might our N-bit programs be capable of generating? All things being equal we find: 1/2N ~ ---------- 1/sqrt(N pi) = 1/2 sqrt(pi/N) which obviously ->0 (and is only good for large N anyway). \end{argument} If N were in the 10k's of bits (typical of files in my directory, anyway) then we would expect a 50% compression of `balanced' files using random number generators to occur in ~ 1% of cases -- not too often! As a kind of VERY ROUGH experiment, I ran all my current files through the Unix compress utility and selected those that had close to 1/2 1's. We find the following: Fraction 1's: Reduction factor: 0.506969 38.07% 0.47619 -33.33% 0.492063 -55.55% 0.491462 33.09% 0.502947 36.47% 0.503021 35.94% 0.478355 26.01% 0.492063 -55.55% 0.49483 30.57% 0.47619 -55.55% 0.510204 32.50% 0.47619 -55.55% 0.525429 31.15% 0.505414 22.98% 0.487941 12.12% 0.514909 48.48% Out of about 100 files these above contained between .48 and .52 (approx) 1's vs 0's. There were 16 attempts to compress (using the `standard' Lempel-Ziv adaptive coding). There were 5 failures. The average reduction where successful (geometric mean) was 30.04% (i.e. files compressed to about 70% of original length). Only 1 -- the last -- came _close_ to 50% reduction. Maybe a statistical accident? But then again this utility doesn't use the method I was talking about... My remaining argument, not having a fully-working version of the compression algorithm I was discussing in my original post to run on text files, must run along the lines of ``text files ain't random'' and thefore the ~ 1/sqrt(N) argument above does not apply. -kym
ttobler@unislc.uucp (Trent Tobler) (03/27/91)
From article <1991Mar25.054838.15588@bingvaxu.cc.binghamton.edu>, by kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell): > > I found one little problem with my algebra & understanding. > Delta modulation takes N bits into O(N) bits, NOT less. > > The theoretical compression factor for my binary strings example > would therefore be > > 2p(1-p) > > where p is the proportion of 0's (or 1's -- it doesn't matter). > > For /usr/dict/words the proportion of 1's is about 56% (only > considering the lsb 7 bits of each char). This should give > a compression factor of about 49% with the method outlined. > > Tables of random ascii digits contain about .38 1's (taking only lsb 7 > bits) which leads to a similar .47 compression factor. If we massage > digit strings to 4-bit groups (by subtracting the '0') we find only > about 6% 1's and therefore the compression factor would be .1 (i.e. a 1Mb > file of random digits -> 100Kb). > > Again, if there are any glaring errors, please comment. > Glaring error #1: if maximum compression of a binary string is 2p(1-p), what is maximum compression of the binary data... 01010101010101010101010101010101 Maximum compression is defined by entropy of the file, or in other words, given a file, how many bits is required to determine uniquely the next bit of information. What is hoped for is that the number of bits is less than 1 for the files you want. Dr. Dobbs had some articles that talked about compression in #173, Feb. 1991. I can give you some of the references... Huffman, D.A. "A Method for the Construction of Minimum Redundancy Codes." _Proceeding of Institute of Electrical and Radio Engineers_ 40 (9) 1098-1101 September, 1952 Shannon, C.E. & W. Weaver _The Mathematical Theory of Communication_ Urbana, Illinois: Univ. of Illinois Press, 1949 Lelewer, Debra A. and Hirschberg, Daniel S. "Data Compression." _ACM Computing Surveys_, vol. 19, no. 3 New York, September, 1987 For more, I suggest you look at the Dr. Dobbs Journal I mentioned. -- Trent Tobler - ttobler@csulx.weber.edu
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/27/91)
In article <1991Mar26.220519.1242@unislc.uucp> ttobler@unislc.uucp (Trent Tobler) writes: >From article <1991Mar25.054838.15588@bingvaxu.cc.binghamton.edu>, by kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell): >> would therefore be >> 2p(1-p) >> where p is the proportion of 0's (or 1's -- it doesn't matter). >Glaring error #1: > if maximum compression of a binary string is 2p(1-p), what is maximum >compression of the binary data... > 01010101010101010101010101010101 [further explanation and some references] Thanks for your comments Trent. I must be doing _something_ right -- 1/2 the respondents say ``too low'' and the other half say ``too high''. :-) Perhaps I was a little obscure with the subject line or something, but I have received a lot of email regarding examples similar to the above. What I'll do here is (a) show you some _measurements_, and then (b) present a very informal argument. (I will attempt to show that EVEN THE 01 EXAMPLE ABOVE, can not generally be represented in much fewer than a given number of bits, despite claims that it comes `for free'). PART 1 -- some numbers. ====================== After scrounging around all over my disks I came up with about 100 files with an equal (i.e. balance between 0.45 and 0.55) number of 0's and 1's. I then ran `compress' over them and determined the average factor of sizes after/before. It worked out about 67% over all attempts; about 56% percent in those cases where it was actually successful. A log file is appended, for those that may like to analyse it (probably there's plenty it can tell you about my disks, also plenty it might tell you about how `random' average text files of various programming languages may behave). PART 2 -- an argument. ===================== Let's take the 01 example above. (I have also received much email of similar ilk). It is claimed that this sequence can be represented in a small number of bits. But this can not _necessarily_ be done. Let me ask the question -- how _long_ is this sequence of 01's? If, let's say, the sequence of 01's is 2^1000000 bits long we must encode the _length_ as part of the compressed code. Representing this length does not come for free. We might for example compute the length as a binary number and then use some compression technique to compress _that_ into a compressed format. Eventually this kind of `iterated compression' comes to a fixpoint. Let's just say, without proof, that to encode the 01 sequence repeated N times takes O(log N) bits. I understand this point is covered in the Shannon paper that Trent indicated. This is obviously much less than 1/2 as N increases, but it _is_ only a single example. With the technique above we can _only_ represent 01 sequences. To represent other simple cycles will require more bits to `name' the particular cycle (possibly in a `deeper' compressed format). As Dan Bernstein pointed out in a previous article, and is discussed briefly in Knuth v2, there are a very large number :-) of strings that CAN NOT be represented by ANY program shorter than themselves. I think there are sufficient of these degenerate examples to bring any average figure well up past the (only APPARENT, from the 01 example above) log N/N limit. CONCLUSION ========== My little formula was not intended to EQUAL the compression in every case, for this is obviously not possible (it can not be a function of just p for one thing). Nor was it intended as an AVERAGE over EVERY case. It was intended as a ballpark figure after making certain assumptions about the specific underlying compression technique. (I am rather pleased, however, with its predictive ability in the p=1/2 case as illustrated for LZ compression, below). Cheers, -kym === BTW, some people around SUNY-B claim I do such measurements FIRST, then come up with a theory and subsequently produce them at an opportune moment -- this is usualy not the case, however. I almost NEVER have a theory :-). =================LOG FILE==================== Filename after size/before size using Sun LZ `compress' :mkshortnames 0.524213 :techinstall 0.751825 ApolloGraphicsMakefile 0.346165 CONTENTS 0.629024 Comp.lang.prolog 0.567824 DPram2.1 1 DPram2.1i 1 DProm.1 1 DProm.1i 1 INDEX 0.771619 MKNEW 0.646409 MKOLD 0.644444 Makefile.vax 0.399232 PORT.LOG 0.63328 READ_ME 0.633803 RR-INRIA-1320.dvi 0.549019 ReadMe 0.621053 abc10201.zip 1 apollo.c 0.610215 cal_findresist.1 1 cal_makefile.1 1 change.l 0.674312 chconfig.1 0.690476 chconfig.1i 0.690476 cman 0.911392 comp.sources.index 0.527516 deutsch.net 0.210469 displays.proto 0.836957 dpram2.1 1 dpram2.1i 1 dprom.1 1 dprom.1i 1 drc.1 1 drc.1i 1 esim.1 0.511555 espresso.1 0.571458 esptype.h 0.357095 ex.rc 1 exrc 1 ext.h 0.403077 extio.h 1 extra+.c 1 filehead.h 0.603837 gemini.1 0.516526 gen_control.1 0.550769 gen_control.1i 0.550769 geometry.3 0.495135 grApollo1.old 0.44325 grApollo4.c 0.522812 grApolloInt.h 0.622093 grOmegaInt.h 0.694301 grSunInt.h.old 0.595443 gram 0.627575 if.how 1 install 1 laygen.1 1 les.txt 0.541202 m.bat 1 m2cmp20.zip 1 m2doc20.zip 1 m2exa20.zip 1 m2lib20.zip 1 m2utl20.zip 1 magic.1 0.417558 magicutils.3 0.609205 makewhatis.csh 0.717697 mkdisttape 0.599204 mod2txt.arc 1 nc.1 0.589175 ndep.down.out 0.909836 ndep.up.out 1 nenh.down.out 0.396806 nenh.up.out 0.422425 nenhp.down.out 0.4109 netgen.1 1 netlist.1 0.598494 netlist.1i 0.598494 nload.up.out 1 nsuper.down.out 0.900826 nsuper.up.out 1 ohmics.1 0.532989 om-esubs.c 0.453863 om-subs.h 0.497336 out_struct.h 0.641509 panda.h 0.482293 paslib.i 0.428077 paslib2.i 0.42933 paslib3.i 0.427255 paslib4.i 0.428455 pattern.c 0.588988 plowDebugInt.h 0.691729 presim.1 0.636028 presim.1i 0.636028 procs.h 0.349669 prolog.lib 0.540272 rofix 1 route1.net 0.742857 route2.net 0.742857 rsleeper 0.893617 rsleeper.csh 0.893617 runstats.3 0.602226 s.Makefile 0.425905 scanner.h 0.714286 select.h 0.607029 showcdrcs.1 1 showcdrcs.1i 1 showsdrcs.1 1 showsdrcs.1i 1 signals.h 0.754472 simsort.1 1 spice_cor.1 0.544245 statlib.index 0.619519 tags 0.433903 test4.m 0.402351 test6.m 0.407745 tlogic.c 0.520635 ttable.c 0.164343 tut7b.net 0.625731 tut7d.net 0.625731 util.c 0.620143 valtbs.1 1 vaxu_unsw_gcc.bench 0.949367 win.c 0.584328 win.c- 0.572215 win.unx 0.584328 wireUndo.c 0.48959 geo avg of all=0.665099 geo avg of `successful' cases= 0.557706 ====================END END END====================
jj@alice.att.com (jj, like it or not) (03/29/91)
Please, folks, you're confusing several issues here. First, the question of "how much can I compress X" must be qualified. If you have a historyless compression, you can compress up to the entropy ( sum of -p log p) bound, where p is the probability of each token. If you have history in your compression algorthm, you may (or may not) be able to do better, depending on the randomness of your token distribution (even given a fixed distribution of tokens). You must separate the two issues, and be sure to mention which (or what) bound you refer to. -- -------->From the pyrolagnic keyboard of jj@alice.att.com<-------- Copyright alice!jj 1991, all rights reserved, except transmission by USENET and like free facilities granted. Said permission is granted only for complete copies that include this notice. Use on pay-for-read services specifically disallowed.
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/29/91)
In article <20135@alice.att.com> jj@alice.UUCP (jj, like it or not) writes: >Please, folks, you're confusing several issues here. Who/what/where -- _ME_ -- confused??!! :-) >First, the question of "how much can I compress X" must >be qualified. If you have a historyless compression, >you can compress up to the entropy ( sum of -p log p) bound, >where p is the probability of each token. > >If you have history in your compression algorthm, you may (or >may not) be able to do better, depending on the randomness >of your token distribution (even given a fixed distribution of >tokens). Ok, I remember something of this ilk ``p log p'' stuff from Shannon and his buddies, but I'm not too sure whether you (or he, or whoever) is saying historyless compression can not, no way, ever be LESS than this. But it _seems_ like you are saying this is your understanding, doesn't it? If so, I have a question and another question. Consider the scheme I outlined earlier -- of comparing an input bitstream with a random bitstream and just storing the differences in delta form (e.g. fixed-length or Huffman coded). This doesn't _appear_ to use history -- the next bit output by an uncompressing program doesn't use the previous bits of the uncompressed string to obtain the next bit. The compressing program doesn't build up any dictionary. Or does it? (That's Q1). There _is_ however, a quibble about a few hidden bits -- e.g. the length of the random generator parameters and the compress/uncompress program code. The program, however, is finite. If we make the input string large enough the program length and any fiddling little parameters can be swamped by the length of the string itself. Now consider the following data obtained from an experimental compression program based on the idea above: ==================== h.dat nb=32400 <--- number of bits read n1=0.341605 <--- fraction of 1's -(p log p + q log q)=0.642094 <--- limit 1 -(p lg p + q lg q)=0.926346 <--- maybe this is log_2 instead of log_e? 2pq=0.449822 <--- my limit (sure looks low, doesn't it?) minlen=20625 <--- after some trials this is the MINIMUM LENGTH output string compression=0.636574 <--- less than - sum p ln p n.dat nb=8080 n1=0.336015 -(p log p + q log q)=0.638357 -(p lg p + q lg q)=0.920954 2pq=0.446218 minlen=5028 compression=0.622277 <--- less than - sum p ln p n2.dat nb=16160 n1=0.334653 -(p log p + q log q)=0.637425 -(p lg p + q lg q)=0.91961 2pq=0.445321 minlen=10042 compression=0.621411 <--- less than - sum p ln p n4.dat nb=32320 n1=0.334653 -(p log p + q log q)=0.637425 -(p lg p + q lg q)=0.91961 2pq=0.445321 minlen=20088 compression=0.621535 <--- less than - sum p ln p ==================== Here are 4 data files compressed to LESS THAN the limit quoted above. Certainly not _substantially less_, but less nevertheless. 8-) The files aren't big. But they do vary an order of magnitude in size or so. I therefore presume we _could_ arrange to have the data file size -> inf. In that case, even including the size of the compress/decompress program, we have case(s) where the ``p log p'' limit seems to be beaten. What gives? Perhaps there are additional O(1) factors involved in the ``p log p'' limit? (That's Q2). -kym BTW, the data files are hex and decimal digits with about 50% ascii 0's. The balance of 0 vs 1 *bits* is given in each case above.
gwyn@smoke.brl.mil (Doug Gwyn) (03/29/91)
In article <18263:Mar2518:10:4091@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: >... and there is no way they can all be compressed to length n, >as the method in question claims to do. A shorter argument is to note that any invertible method that maps the domain onto the range, which is known to map at least one point to a shorter representation, MUST map at least one other point to a longer representation. (I think this was noted somewhere in Knuth.)
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/29/91)
In article <20135@alice.att.com> jj@alice.UUCP (jj, like it or not) writes: >First, the question of "how much can I compress X" must >be qualified. If you have a historyless compression, >you can compress up to the entropy ( sum of -p log p) bound, >where p is the probability of each token. Here is some more data and another argument for your amusement. [If you're well up on info thy you may as well `n' here]. The p log p ``bound'' is actually a disguised expectation value; it is an _average_. It denotes the average amount of information per symbol in a string. For example, if the string consists of only the symbols 0 and 1 we write -(p lg p +(1-p) lg (1-p)) for the average amount of information (in bits) per 0/1 symbol. [lg(x) is log base 2 -- the size of the alphabet in this case]. If this average is 0.5, say, then we might expect that we could compress the string to 1/2 its size in some fashion without losing any ``information''since each 0/1 symbol in the string would only be representing 0.5 of a bit on average. (This measure does not take into account the _sequence_ of 0's and 1's in the string; we're actually treating the string as a multiset.) Since we're talking about an _average_ it is quite possible to have particular cases where the ``compression factor'' is less. Averages, after all, are statistical variates and have distributions of their own. So what is the distribution of ``p lg p'' compression factor? Rather hard to write in closed form as far as I know, so let's look at a table. The following shows for each ``compression factor'' the likelyhood that its value is less than `x'. x Pr(X<x) 0.1 0.0251 0.2 0.0624 0.3 0.1072 0.4 0.1578 0.5 0.2199 0.6 0.2964 0.7 0.3813 0.8 0.4853 0.9 0.6331 E(X) 0.720872 var(X) 0.0730559 For example, a file can be compressed to 50% or less about 22% of the time (presuming the strings being used come uniformly from all possible strings). We also see the ``expected'' value for the ``p lg p'' is 72%. We therefore might expect, on average, only to be able to compress a file to 72% of its original size, all things being equal. Looks bad? We can do MUCH, MUCH better. Order statistics is a relatively small branch of the subject that deals with things sorted or arranged in order. It allows us to answer such questions as ``if I pick up N candy bars, what is the largest one I might expect to get'' (given that candy bars are not all exactly the same size)? In our case we would like to know ``if I compressed my file with N different techniques, what might be the BEST I could expect (i.e. the _smallest_ result)''? The tables at the endf show how taking the minimum of several values of ``p lg p'' can quickly reduce the expected value for a _set_ of compression techniques. Whereas a file might be compressed to 1/2 its size only 22% of the time using a SINGLE technique, using TWO such methods will 1/2 the file 39% of the time. Using 9 techniques will do so 89% of the time, etc. Obviously this is a GOOD THING if you can afford the computer time. Various ideas obviously present themselves. You might run several techniques in parallel and only output the one that seems to be performing the best (a little code value in the output might be nice to indicate when you switch methods :-). Another idea is to simply HASH your file by XOR'ing it with 10 different pseudo-random generators and compress each with the SAME technique and take the smallest result. More than about 90% of the time you should get a 50% compression. My little technique is similar to this last -- keep compressing the string using different parts of the SAME random-number sequence. To do this you require a generator that has no correlations. (One of the usual mult cong techniques will probably not suffice). If you now include _HISTORY_ information (i.e. knowledge about the sequence of symbols in the string) you should be able to do even better still, on average. However -- a reality check. There are SOME cases, and they may not be rare enough, where the string has the property whereby it can NOT be compressed no matter WHAT the technique. (I rather like the neat argument posted by Doug :-). For those interested, my little tables follow. Cheers all, -kym === BTW, a question. How much computation do I need to perform to guarantee my previous ``2pq'' result >99% of the time? Probabilities that expected compression factor -(p lg p + (1-p) lg (1-p)) is less than a given value `x' by selecting the minimum result produced by several statistically independent techniques. x Pr(min{X1,...X2}<x) Two techniques 0.1 0.04957 0.2 0.120906 0.3 0.202908 0.4 0.290699 0.5 0.391444 39% of the time compression < 1/2 0.6 0.504947 0.7 0.61721 0.8 0.735084 0.9 0.865384 x Pr(min{X1,...X3}<x) Three techniques 0.1 0.0734258 0.2 0.175762 0.3 0.288356 0.4 0.402627 0.5 0.525265 53% of the time compression < 1/2 0.6 0.651681 0.7 0.763168 0.8 0.863648 0.9 0.95061 x Pr(min{X1,...X4}<x) Four techniques 0.1 0.0966828 0.2 0.227194 0.3 0.364645 0.4 0.496892 0.5 0.62966 63% 0.6 0.754923 0.7 0.853472 0.8 0.929819 0.9 0.981879 x Pr(min{X1,...X5}<x) Five techniques 0.1 0.119356 0.2 0.275417 0.3 0.432755 0.4 0.576283 0.5 0.711097 71% 0.6 0.827564 0.7 0.909343 0.8 0.963878 0.9 0.993351 x Pr(min{X1,...X6}<x) Six techniques 0.1 0.14146 0.2 0.320631 0.3 0.493563 0.4 0.643145 0.5 0.774627 77% 0.6 0.878674 0.7 0.943911 0.8 0.981408 0.9 0.997561 x Pr(min{X1,...X7}<x) Seven techniques 0.1 0.16301 0.2 0.363024 0.3 0.547853 0.4 0.699457 0.5 0.824187 82% 0.6 0.914635 0.7 0.965297 0.8 0.990431 0.9 0.999105 x Pr(min{X1,...X8}<x) Eight techniques 0.1 0.184018 0.2 0.402771 0.3 0.596324 0.4 0.746883 0.5 0.862848 86% 0.6 0.939937 0.7 0.97853 0.8 0.995075 0.9 0.999672 x Pr(min{X1,...X9}<x) Nine techniques 0.1 0.204499 0.2 0.440038 0.3 0.639598 0.4 0.786825 0.5 0.893008 89% 0.6 0.95774 0.7 0.986716 0.8 0.997465 0.9 0.99988
jj@alice.att.com (jj, like it or not) (03/30/91)
In article <1991Mar29.031127.9128@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: >In article <20135@alice.att.com> jj@alice.UUCP (jj, like it or not) writes: >Consider the scheme I outlined earlier -- of comparing an input bitstream with >a random bitstream and just storing the differences in delta form (e.g. >fixed-length or Huffman coded). This doesn't _appear_ to use history -- the >next bit output by an uncompressing program doesn't use the previous bits of >the uncompressed string to obtain the next bit. The compressing program >doesn't build up any dictionary. Or does it? (That's Q1). What you're doing is building a model. If your model fits, you can do better than a historyless method. The model itself is the history, you have to choose the model (you call it a random number generator, stoichastic methods, LPC methods, and such suggest other approaches) from the knowledge of the signal. That's your history. If your model (whatever it is) doesn't fit, you'll likely need MORE bits. >-(p lg p + q lg q)=0.926346 <--- maybe this is log_2 instead of log_e? Always log base 2. Not ever log base e. Sorry, thought that was obvious from context, since we're talking about bits here. >What gives? Perhaps there are additional O(1) factors involved in the >``p log p'' limit? (That's Q2). Anytime you have a model of some sort, you aren't doing historyless coding. You are assuming SOMETHING about the signal to be coded beyond the single-token distribution. -- -------->From the pyrolagnic keyboard of jj@alice.att.com<-------- Copyright alice!jj 1991, all rights reserved, except transmission by USENET and like free facilities granted. Said permission is granted only for complete copies that include this notice. Use on pay-for-read services specifically disallowed.
jj@alice.att.com (jj, like it or not) (03/30/91)
Aww, come on. rather than use 10 methods, send some side info that paramaterizes the data you have. It's not hard or expensive to send the whole (*&( codebook, for heaven's sake, as part of the message. -- -------->From the pyrolagnic keyboard of jj@alice.att.com<-------- Copyright alice!jj 1991, all rights reserved, except transmission by USENET and like free facilities granted. Said permission is granted only for complete copies that include this notice. Use on pay-for-read services specifically disallowed.
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/02/91)
In article <20144@alice.att.com> jj@alice.UUCP (jj, like it or not) writes: >What you're doing is building a model. If your model fits, you can >do better than a historyless method. The model itself is the >history, you have to choose the model (you call it a random number >generator, stoichastic methods, LPC methods, and such suggest other >approaches) from the knowledge of the signal. That's your >history. I'm not sure I agree with you. The little program I used previously to demonstrate that `p lg p' could be beaten by an historyless technique does not, to my mind, build any model of the data. It does not examine the data to `fit' the random number generator to it (although I have others that do this too). It just compares the input with a bit stream and records where they're different and stores the `deltas' between these positions in the output. Nothing is kept to `improve' anything as we go along; nothing is assumed before we start (the random bit generator was selected at random). I guess the mere _selection_ of the random bit generator _may_ be considered building a model of the data -- in which case _any_ compression technique selects such a model & there _is_ no historyless compression (as you seem to imply that model-building and historylessness are exclusive). But to get back to my original question. In article <20135@alice.att.com> jj@alice.att.com (jj, like it or not) writes: >First, the question of "how much can I compress X" must >be qualified. If you have a historyless compression, >you can compress up to the entropy ( sum of -p log p) bound, ^^^^^^ >where p is the probability of each token. Is p lg p a hard-and-fast bound or not? I still think it's an average. -kym
weverka@spot.Colorado.EDU (Robert T. Weverka) (04/02/91)
In article <1991Apr2.034441.28170@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: ...heavily edited... >The little program I used previously to demonstrate that `p lg p' could be >beaten by an historyless technique does not, to my mind, build any model of >the data. > >In article <20135@alice.att.com> jj@alice.att.com (jj, like it or not) writes: >>First, the question of "how much can I compress X" must >>be qualified. If you have a historyless compression, >>you can compress up to the entropy ( sum of -p log p) bound, > ^^^^^^ >>where p is the probability of each token. > >Is p lg p a hard-and-fast bound or not? I still think it's an average. > >-kym It is an Expectation value. log(1/p) is the number of bits required for each symbol. The Expectation value of this for different probabilities is E(log(1/pi)) (Read pi as p sub i) E(log(1/pi))= ( sum of -p log p)
jj@alice.att.com (jj, like it or not) (04/03/91)
In article <1991Apr2.034441.28170@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: >In article <20144@alice.att.com> jj@alice.UUCP (jj, like it or not) writes: >I guess the mere _selection_ of the random bit generator _may_ be considered >building a model of the data -- in which case _any_ compression technique Exactly. >selects such a model & there _is_ no historyless compression (as you >seem to imply that model-building and historylessness are exclusive). No, you have a model that subsumes history, in that the RNG has predictable outputs, and you use it as a filter. That is different than knowing the single-token probabilities alone. >Is p lg p a hard-and-fast bound or not? I still think it's an average. It's ALWAYS the average rate. Given that you're using compression, you can't measure any other thing, unless you have a fixed message set, and that's all. (In which case you should enumerate your messages, perhaps huffmanwise...) If you have any set of data with the overall 'p' known, that is the average rate for THE WHOLE SET. Parts of it may do better, because they may hit shorter symbols (for instance), but if you look at the probabilities of THAT PART, you should find that you don't do better than the p log p chosen over your locality, without a model. I think you need to revisit the basic statistical assumptions of compression. -- -------->From the pyrolagnic keyboard of jj@alice.att.com<-------- Copyright alice!jj 1991, all rights reserved, except transmission by USENET and like free facilities granted. Said permission is granted only for complete copies that include this notice. Use on pay-for-read services specifically disallowed.
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/03/91)
In article <20159@alice.att.com> jj@alice.UUCP (jj, like it or not) writes: [I wrote] >>Is p lg p a hard-and-fast bound or not? I still think it's an average. >It's ALWAYS the average rate. Given that you're using compression, >you can't measure any other thing, unless you have a fixed message >set, and that's all. (In which case you should enumerate your >messages, perhaps huffmanwise...) If you have any set of data >with the overall 'p' known, that is the average rate for THE WHOLE >SET. Parts of it may do better, because they may hit shorter >symbols (for instance), but if you look at the probabilities >of THAT PART, you should find that you don't do better than >the p log p chosen over your locality, without a model. Ta. -kym
chris@ug.cs.dal.ca (Chris Robertson) (04/03/91)
In article <15626@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes: > ... any invertible method that maps the > domain onto the range, which is known to map at least one point to a > shorter representation, MUST map at least one other point to a longer > representation. (I think this was noted somewhere in Knuth.) Wouldn't it depend on the characteristics of the data to be compressed? Under certain restrictions, ie, if there is redundancy built into the domain, then why _couldn't_ you get a mapping such that every element of the domain had a correlate shorter than itself? - chris
gwyn@smoke.brl.mil (Doug Gwyn) (04/03/91)
In article <1991Apr3.001158.12011@cs.dal.ca> chris@ug.cs.dal.ca (Chris Robertson) writes: >Wouldn't it depend on the characteristics of the data to be compressed? The domain was assumed to be that of all N-bit messages.
readdm@ccwf.cc.utexas.edu (David M. Read) (04/04/91)
In article <> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: >But to get back to my original question. > >Is p lg p a hard-and-fast bound or not? I still think it's an average. > >-kym Speaking as a physicist (who may be a little naive when it comes to "information science"), if sum (p lg p) does indeed represent the total entropy of the ensemble (a proposition which I am not sure I trust entirely, but that's beside the point), then that is indeed the limit of what your compressor can do. You may not violate the Second Law! You can get arbitrarily close to that sum, but you may not go under it, unless you *increase* entropy somewhere else. Perhaps creating a "history" (I missed that part of the discussion, I'm afraid) "creates" some entropy, which would allow you to dip below the initial entropy of the ensemble, but it seems to me that doing that would also increase the number of bit which you need to transmit...so where's the "win?"
jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) (04/04/91)
From article <46618@ut-emx.uucp>, by readdm@ccwf.cc.utexas.edu (David M. Read): >>Is p lg p a hard-and-fast bound or not? I still think it's an average. > > You can get arbitrarily close to that sum, but you may not go under > it, unless you *increase* entropy somewhere else. Right, but, speaking as a Computer Scientist with a degree in Physics ... Information theoretic entropy is mathematically identical to quantum mechanical entropy. I think that this is one of the biggest surprises in 20th century science, given that Shannon didn't set out to apply statistical thermodynamics to information flow problems. Once he recognized that the mathematics was the same, he was quite happy to borrow names like entropy for the information theoretic quantities that he needed to name. Back to the subject, the entropy of a message, is a hard and fast bound on the compression only in a statistical sense. If your compression algorithm compresses some message to fewer bits than the entropy indicates, then your algorithm must compress some other message to more than p log p bits. This is the tradeoff that parallels the second law of thermodynamics. In information theory, the entropy of a message can only be measured relative to a statistical model of the source. For example, we can use observed letter frequencies to build a model for English, but having done so, we have no guarantee that the language used from that point on will correspond to the frequencies we measured. Also, better source models, for example, letter pair frequencies or word frequencies, lead to lower entropy values. Doug Jones jones@cs.uiowa.edu
victor@watson.ibm.com (Victor Miller) (04/04/91)
There are two kinds (at least) of information: statistical, of which entropy is the answer (only in an expected sense), and computational, as in Kolmogorov-Chaitin-Martin-Lof, etc. My favorite example to get people thinking is from a paper of Ziv and Lempel: Construct a sequence of bits as follows: first write down in order all 1 digit binary numbers (with leading zeroes if they are there), then all two digit, etc. The surprising fact that they prove is that this sequence is incompressible by ANY finite-state compressor (of which Huffman, or arithmetic coders are good examples), but it is rather easy to see that an efficient way of transmitting this sequence is to send a program to generate it, along with the length of the sequence. -- Victor S. Miller Vnet and Bitnet: VICTOR at WATSON Internet: victor@watson.ibm.com IBM, TJ Watson Research Center
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/05/91)
In article <VICTOR.91Apr4102353@irt.watson.ibm.com> victor@watson.ibm.com writes: >people thinking is from a paper of Ziv and Lempel: Construct a >sequence of bits as follows: first write down in order all 1 digit >binary numbers (with leading zeroes if they are there), then all two >digit, etc. The surprising fact that they prove is that this sequence >is incompressible by ANY finite-state compressor (of which Huffman, or >arithmetic coders are good examples), but it is rather easy to see >that an efficient way of transmitting this sequence is to send a >program to generate it, along with the length of the sequence. Hmmm, interesting. I presume the same is true for all ``normal'' sequences over a given alphabet? -kym