kent@xanth.UUCP (Kent Paul Dolan) (04/21/87)
Amigans, I ran an extensive series of tests based on having all files in a directory either compressed (whether compress did anything with them or not) and the same set of files arced, in single file arcs for large files, or in merged file arcs for small files. Arc did very well, and it looks like using arc where we have been using compress will save a lot of space for us in the public directories. Here are the results: Files Storage Blocks ---------------------------- -------------------- Percent Directory Pictures Non- Wouldn't After Original After After saved Pictures compress arc total arc compress by arc 1) art 43 2 7 45 960 764 765 00.1 2) digitized 22 1 12 8 985 835 956 12.7 3) images 19 0 3 19 559 413 471 12.3 4) abasic 0 96 1 90 398 243 232 -4.7 5) abasic 0 96 1 3 398 201 232 13.3 6) amigabasic 0 13 0 13 58 39 39 00.0 7) amigabasic 0 13 0 1 58 32 39 17.9 Notes: 1) ~public/amiga/pictures/art files - even though a sixth of the files did not compress, this was about a tie. Arc lost on the indices, but gained by squeezing the files compress couldn't make smaller. This test was done before I added the latest pictures. 2) ~public/amiga/pictures/digitized files - these are mostly HAM files, and compress lost big by not finding a good technique. Every file that was compressed was smaller than the corresponding arced file, but the losers, all of which arc squeezed, more than tipped the balance. 3) ~public/amiga/UNTESTED/taug/pd-12/images files - these are now in #1, above. Like the digitized files, the relatively large portion of files (again about 1/6) that compress couldn't improve tipped the overall balance. 4) ~public/amiga/abasic files - a group of 93 small sources plus 3 data files. This and the next are the same set of files. Notice that the files are roughly 4K in size. The only file compress couldn't handle, arc packed; all the rest, both arc and compress were using the same algorithm, so the results favored the scheme without the table of contents. Note that there were two groups of files, one of size three files and one of size two, that were actually related, and had to be arced into single arc files to maintain the relationship. There were also three other cases in which files names were identical to the first eight characters, so arc packed two files into one arc file. That accounts for the "missing" six arc files. The trouble with either scheme here is that the files are short relative to the size of the unused fragment in the last block of each file, so that the next result will reward having fewer files. 5) Using the same files, the two natural groups were arced together, and the rest of the files were arced into a single arc file. The result of cutting the fragmentation loss is an 18% swing in favor of arc over compress, which always compresses each file separately. 6) & 7) ~public/amiga/amigabasic - a set of 13 short source files. The same trick works for a smaller set of amigabasic files; putting them into one file, even at the cost of the table of contents and CRCs, arc saved enough on fragmentation to beat compress. Even with separate files, arc ties compress, although compress compressed all 13 files here. I think the equality in 6 was a fluke; there is enough slop in the end block fragementation to disguise a small effect like the arc table of contents occasionally; I think arc just lucked out on trial 6. Discussion: The advantages of arc are that it keeps related files together, maintains a CRC for each file, saves on directory entries, especially on the Amiga, where every file seems to need its own directory block, and makes directories neater looking and easier to maintain. Arc usually finds compression methods for files on which compress fails to save space (always in the cases seen here.) Arc saves space overall. A mix of arc and compress in a single directory of large files might even save more space, but would make file maintenance more of a hand chore. The disadvantages of arc are that it comes from an MS-DOS environment,and so cannot handle the longer file and directory names of the Unix (tm) and AmigaDOS(tm) environments. This is sometimes a problem; arc will not correctly recognize an arc file with a root portion longer than 8 letters, and will not store a file name internally longer than 16 characters. Arc is also slower to pack files, because it must try six or eight different compression techniques before deciding on the one to use for a file, and slower in unpacking a file, because it has a distributed table of contents, and must walk the table and the files between the table entries, to find a file. (This latter is a supposition on my part because of the way the disk drives behave when arc is running.) The advantages of compress are that it is faster and keeps full length file names. The latter advantage is moot for the environment in which we use these files, because kermit is also limited in file length, and it is very crucial if the .Z is removed from a file during transmission from the VAX, to an Amiga(tm), for compress then won't recognize it. Compress also makes all the files easily accessable and visible, at the cost of storage and directory entries. Compress is also a more mature, less buggy piece of software, although arc is highly improved over the last year. Since our primary concern right now is space rather tahn cpu cycles (most of the arc and kermit activity takes place in the evening), I will continue to change directories of compressed files to directories of arced files in our public domain Amiga software and data files, based on the above results, unless directed otherwise. Since this is useful, and almost research, I am going to almost publish it by posting it to comp.sys.amiga, to share with the folks who shared this software and data with us in the first place. Kent. -- The Contradictor Member HUP (Happily Unemployed Programmers) // Yet // Another Back at ODU to learn how to program better (after 25 years!) \\ // Happy \// Amigan! UUCP : kent@xanth.UUCP or ...{sun,cbosgd,harvard}!xanth!kent CSNET : kent@odu.csnet ARPA : kent@xanth.cs.odu.edu Voice : (804) 587-7760 USnail: P.O. Box 1559, Norfolk, Va 23501-1559 Copyright 1987 Kent Paul Dolan. How about if we keep the human All Rights Reserved. Author grants free race around long enough to see retransmission rights, recursively only. a bit more of the universe?
dillon@CORY.BERKELEY.EDU (Matt Dillon) (04/22/87)
Interesting comparision. One thing I like about compress is that it can take it's input from stdin, and output to stdout. I can then use PIPE: to buffer data going into the compress (since disk IO is much faster than compress can handle) and thus get 100% utilization of the CPU decompressing my VD0: from floppy. (how's that for a mouthful!) -Matt
rs@mirror.UUCP (04/22/87)
Most sites on Usenet use the COMPRESS program to cut down their transmission costs. Compressing an ARC'd file results in something bigger than the original. As a courtesy, then, to those sites who pay big money to ship articles around, the nicest way to post binary files is to use UUENCODE. The C sources to UUENCODE have been posted to mod.sources; a portable shar/unshar package was posted in net.sources. Contact your nearest mod.sources archive site for the former, me for the either. Also, people have written that "well, it's easiest for ME to do things this way." In an internetworked electronic conferencing system of several thousand sites, the ruling theme should be the greatest good for the greatest number. Rise above selfishness, and go out of your way a bit to help the net that helps you: conventional wisdom and empiricism have shown that posting UUENCODE'd binaries work best, given that COMPRESS is usually in the transport mechanism. As lesser arguments, assuming appeals to your "better side" don't work: ARC is shareware, which goes against the grain of the so-called free Usenet, don't you think? The maintainers and developers of ARC refuse to support Unix ports. Think about the implications of those two sentences... Comments to me, and I'll summarize. /Rich $alz, Moderator of mod.sources (soon to be comp.sources.unix) -- -- Rich $alz "Drug tests p**s me off" Mirror Systems, Cambridge Massachusetts rs@mirror.TMC.COM {cbosgd, cca.cca.com, harvard!wjh12, ihnp4, mit-eddie, seismo}!mirror!rs