xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (05/27/90)
In article <1103@tardis.Tymnet.COM> jms@tardis.Tymnet.COM (Joe Smith) writes: >Another comment about compression algorithms: After reading enough of the >input file, LZW starts assigning codes to entire words and phrases. For >instance, "(char *)malloc(" can get replaced by a single 13-bit code. Or a >single code can replace the phrase "the party of the first part" whereever >it appears. Files that contain the same words, over and over, become much >smaller when compressed. As you can imagine, COBOL programs and legalese >compress well. I'm playing around with the arithmetic encoding/decoding algorithm from CACM of a couple years back, replacing the "most frequent to front" linear array lookup of the "how often this character was seen so far" information with a heap based maintenance of the same information, in hopes of a small speedup. It works so far, but the speedup is a bit iffy: random input runs faster my heap based way, strongly biased input runs faster with the fast front linear search (proving once again that an order(log N) algorithm only beats an order(N) algorithm if the constant multipliers are close and N is "sufficiently large". Anyway ... in testing my algorithm against the original, I had occassion to create and compress a file consisting of 100,000 bytes comprised of lines of 49 "0"'s and one newline. Because arithmetic compression is capable of using fractions of a bit of output to represent a byte of input (yes, that really means what it says and is possible!) arithmetic compression managed to compress the 100,000 byte input into 2338 bytes of encoded data, beating out the LZ "compress" algorithm by some 400 bytes. It is interesting to find that 1) arithmetic compression is capable of almost 50 to 1 compression 2) can, for certain data, beat LZ compression, and 3) runs reasonably fast for this case, even compared to "compress". Data where it might be useful to consider arithmetic compression would be any extremely sparse data, such as line drawn images. This is, incidently, my first excursion into C on the Amiga, using Lattice C release 5.04. I am underwhelmed: a too-long "#define" definition not only prompted the compiler to fail to compile my program, but also to crash my system and to trash Rad: as well. A little bullet proofing would seem to be a given by the time a commercial product has gone through this many revisions. On the bright side, for a guy with no hard disk but lots and lots of ram, compiling with all the files and executables memory resident is sufficiently fast to be a pleasant surprise. It approaches the speed (though with nowhere near the convenience or productivity) of Benchmark Modula-2 in the same system. I recommend spending bucks on memory until the box is full before looking for a hard disk; the memory is lots more useful. In 9.5 meg, I keep the whole workbench, lots of other goodies (emacs, zoo, Amigabasic, etc, etc), the entire Lattice C development system, and have about 1.5 meg of Rad: to work in, and 2.0 meg of available memory for executing code, screens, windows, and other active data. Of course, I'm still watching the optical rewritables like a hawk. Anybody put an ORWD on a SCSI interface on the Amiga for a price the impoverished could enjoy? Kent, the man from xanth, now zooming in from Zorch. xanthian@zorch.sf-bay.org