paul@vixie.UUCP (Paul Vixie Esq) (05/25/87)
Well, okay, this is a very late summary of the responses to my question, "should we compress news in the spool directories?" The answer, it seems, is "no". Although I feel like I must have been pretty ambitious that day to propose such a thing, I remain convinced that it would be an idea worth trying. If I try it, I'll let everybody know how it works out. Since the commonest response was "that would take all kinds of CPU time for the news readers", let me deal with that. It takes MUCH LESS TIME to uncompress than to compress something. Try 'zcat' on a very long file, and you'll see what I mean. The second-most-common response (subjectively) was "it would only be of use on a small-disked, single-user system." Well, that's me! I have about 130Mb of disk, and 22Mb is taken up just keeping 7 days of news on line. I'd like to have that other 10~12Mb for source code and other things ... As for the complaint that compression doesn't help much with small files, which most news articles are (small, that is), I don't know. When you have hundreds of files taking up tens of megabytes, a little bit would go a long way. Here are the responses, with headers and signatures pruned: --------------------------------------------------------------------------- Organization: MECC Technical Services, St. Paul, MN Date: 16 Apr 87 10:22:46 CST (Thu) From: sewilco@meccts.MECC.COM (Scot E. Wilcoxon) I've been thinking about the same thing for a while. I think the articles should be compressed by replacing words and phrases with short codes. The `compress` program is a good general-purpose tool, but one oriented specifically toward English can do a good (better?) job. Also, the news-reading programs need a format which can be decoded quickly and cheaply. If the codes happen to be similar to the escape sequences for macro expansion of a terminal (say, the NAPLPS macros), expansion of the text could even be done by a terminal or PC program. For batching of such articles, the "sys" file entry could specify if the other site can handle articles with this new compression. If this is the case, a new program can handle it instead of 'cunbatch'. ------------------------------------------------------------------------- Date: Thu, 16 Apr 87 08:48:21 EDT From: John Owens <sun!xanth.cs.odu.edu!john> The biggest problem I see with your solution is the incredible use of CPU time this would take whenever anyone read news. I know that sometimes, when I'm on a fast terminal, I'll skim through some newsgroups with the space bar when I'm in a hurry, seeing if anything catches my eye. I wouldn't be able to do this with your scheme - it would take at least a second or two to uncompress an article on a moderately-loaded system, and would take much too long, and add much too much to the load, on a heavily loaded system. Of course, if you're reading news on a private system, like a PC running Xenix, this wouldn't be as important, and since these systems are the ones most likely to have a severe shortage of disk space, this could work out after all. I would suggest having the compressed data in the article, and piping that part of the file through compress/uncompress as needed - two files don't buy you anything, since this is as easy as piping the entire file. An X-Compressed: header would be sufficient to mark compressed articles, but please don't compress any article less than, say, 2 to 4 blocks! Actually, you might want to have a configurable size beyond which you compress the article - that would be more useful than a per-newsgroup compression. Each system could then decide what the tradeoff point is - maybe only compress really large things, like digests, source, and encoded binaries.... I presume you're volunteering to do the work.... :-) --------------------------------------------------------------------------- From: <ames!rutgers!cbmvax!snark!eric> Date: 16 Apr 87 10:09:11 EST (Thu) The news rewrite I've been doing (TMN netnews) allows you to flag a group 'compressed', in which case all its messages are automatically stored compressed, and uncompressed when a reader wants to look at them. Unfortunately, this turns out not to be a big win. Most articles aren't large enough so that compression makes a big difference. If you'd like more complete info or to join the beta site list, email me. -------------------------------------------------------------------------- Date: Sat, 18 Apr 87 16:05:46 PST From: ihnp4!ho95e!wcs Organization: AT&T Bell Labs 46133, Holmdel, NJ I've thought about that one a few times too, but haven't had the time to do much with it. I've seen 3 main problems with it; you've hit the ones about headers and fragmentation. The other problem is that, purely aside from fragmentation issues, compress is much more efficient on large files than on small ones, because it's seen a lot more strings that it can compress into one word. Here are some of my suggestions: - headers make up a significant fraction of the average article, especially if it's been through a couple of network gateways. It might be worth developing an abbreviated header format - put a magic word at the beginning (e.g. D-news), then use one-or-two-character abbreviations: Start the line with S instead of "Subject: ", etc. (yes, you've got to teach all the newsreaders about it.). - Compress normally starts with a dictionary of 1-character strings; you could modify it to start with a large number of known strings, taken from typical news articles. In particular, you should include allhe standard header strings (Subject: ), the newsgroup names (comp.sys.ibm.pc), and a lot of the important machine names (!ihnp4, !ucbvax, .BITNET, .COM). You'd need a different magic word than compress normally uses. However, this does mean you don't have to change the header format any more, just know the words. You also gain some efficiency, because you've learned information from longer files that you're able to use in shorter files as well. ---------------------------------------------------------------------------- >From: joe@auspyr.UUCP Date: 16 Apr 87 23:37:02 GMT Organization: Austec, Inc., San Jose, CA. USA In article <536@vixie.UUCP>, paul@vixie.UUCP (Paul Vixie Esq) says: > What about compressing the news data in the spool directories? If compress > can save half the transmission time, it ought to be able to save almost that > much in the storage costs as well. It isn't quite the same win, since there > is the file system's frag size to consider -- but it's a win, just the same. It's a great idea ... but on the other hand, is the decompression-to-read time worth it? I mean, sure, compressing, batching, and then uncompressing would waste alot of CPU time *if* your proposal were implemented, but that wasted time would be nothing when measured against the CPU time of decompressing the same message for 50 net-news readers (people); not to forget that THEY have to wait for the decompression as well! It seams to me that decompressing 500 messages X 50 users tends to use more CPU time then decompressing all of them once. As things are going ... communication costs are certainly more expensive (money wise) than disk storage, so what's the problem? ----------------------------------------------------------------------------- From mit-eddie!husc6!linus!philabs!briar!rob Sun Apr 19 22:12:07 PST 1987 There are several drawbacks to keeping news compressed inside the spool directories. The two big ones are CPU cycles and that compress only gives you real gains if what you are trying to compress is reasonable sized. On big files, such as source archives, compress will give you about a 50-60 percent savings in space. On the average length article 30-40 lines you'll get about a 20-30 percent savings (on just the article, not the header). On a one line article with compression you gain space due to the compression tables. The point being that your not going to be gaining the amount of space you might be expecting. Now for CPU cycles. Compress, especially with big articles is very expensive computationally. Each time an article is read, your going to have to invoke compress. For small articles, this is not that expensive, but for *.sources.*, your going to notice it in both user wait time and cpu time. Also, in batching news your going to have to uncompress the articles your sending to another site, and recompress them (you want BIG batches to make better use of compress, you also want compatiblity with people who do not do your method of compression). This posting is not meant to dissuade you, just point out some design considerations. There are several situations where you might want to do this. If you had severe disk limitations, alot of unused cpu cycles and/or only a few net readers. -- william robertson philabs!rob@seismo.css.gov ----------------------------------------------------------------------------- From ucbvax!umnd-cs!umn-cs!sundquis Sun Apr 19 22:12:49 PST 1987 In article <536@vixie.UUCP> paul@vixie.UUCP (Paul Vixie Esq) writes: > >The hard part is the headers -- they should not be compressed because they >are examined independent of the (much longer) data quite often -- in expire, >in subject searches, etc. In my view, the headers would be better left >uncompressed. So we can either put compressed and uncompressed data in the I computed a few statistics about news articles to help determine how efficient such a compression scheme might be. The average article size on our system was about 2600 b. (This includes large archives in various ``source'' newsgroups). The average header size was about 550 b. The average compression rate of article bodies was roughly 50%. (This did not include large ``source'' articles.) Hence the net compression rate would be about ((2600 - 550) * 50% +550) / 2600 = 61% I.e. keeping headers uncompressed (necessary) results in a 10% loss in compression rate. I don't know if this is pathological to the argument but needs to be considered... Tom Sundquist sundquis@umn-cs.arpa rutgers!meccts!umn-cs!sundquis --------------------------------------------------------------------------- From ptsfa!ames!rutgers!dayton!meccts!sewilco Sun Apr 19 22:13:12 PST 1987 Most posters are assuming that "compression" requires the `compress` program or algorithm. It should not be implied (except by someone who has implemented the idea using the L-Z algorithm :-). -- Scot E. Wilcoxon (guest account) {ihnp4,amdahl,dayton}!meccts!sewilco ---------------------------------------------------------------------------- Date: Mon, 20 Apr 87 13:33:23 est From: sun!seismo!genat!methods!peter (Peter Blake Mr Sys Adm) I personally would not like to see the individual news articles stored in compressed format for the following reasons: 1) It would be far too expensive CPU wise to be uncompress an article for some one to read. Consider a site that has a large number of users and whom a decent percentage read the news. Can you justifing uncompressing the same article at least 50 times in one day (especially if its a biggie) when that same site has oodles of free disk space? 2) Compressed files are not convinient to 'grep' through (this is something I've many times when I can remember the group but now the article that something interesting was mentioned in). 3) Increasing the number of files required to represent a article (eg. the article's header file and its compressed data file) mean that you'll need roughly twice as many inodes on that file system. Sometimes it is not possible to increase the number of inodes on a file system in a sufficient manner, what does the SA do then? I think that breaking up an article into two files will dramatically increase the number of disk fragments as well. Summary: News articles should not be separated into a header file and a compressed data file because: - the increase in CPU expense is not justifiable. - eliminates easy and convient text manipulation on articles. - increases the number of inodes available. - increases disk wastage due to fragmentation. These were just a few points that come to mind. The last two points are very fresh in my mind, cause I've just wasted a glorious day outside while rearranging my file systems in order to get more inodes on my news file system (I had the free blocks but not free inodes). If you have any questions regarding what I've said, please feel free to ask. ---------------------------------------------------------------------------- Date: Tue, 21 Apr 87 09:48:28 EDT From: ames!seismo!rochester!srs!matt (Matt Goheen) Unless you have alot of computer cycles to spare, I wouldn't recommend this strategy. Every time someone reads an article you would have to uncompress the body of it (assuming you have figured out how to do the header stuff correctly). It's a matter of disk space vs. CPU cycles and I think in this case, disk space wins hands down... ---------------------------------------------------------------------------- -- Paul A Vixie Esq 329 Noe Street {ptsfa, crash, hoptoad, ucat}!vixie!paul San Francisco ptsfa!vixie!paul@ames.ames.arc.nasa.gov CA 94116 paul@vixie.UUCP (415) 864-7013