paul@vixie.UUCP (04/15/87)
I'm willing to do some of the work for this, but I want to see what else has been done or thought about it before I start. So.......... What about compressing the news data in the spool directories? If compress can save half the transmission time, it ought to be able to save almost that much in the storage costs as well. It isn't quite the same win, since there is the file system's frag size to consider -- but it's a win, just the same. The hard part is the headers -- they should not be compressed because they are examined independent of the (much longer) data quite often -- in expire, in subject searches, etc. In my view, the headers would be better left uncompressed. So we can either put compressed and uncompressed data in the same file -- not simple, but possible -- or we can seperate the data into other files. This irritates the waste in the file system frag size, but I think we would gain more in the compression than we would waste in the half- empty blocks we would add in this scheme. Perhaps an additional header line is in order -- starting with 'X-' in the tradition of lines which should not be passed out of the current system. How about 'X-Data-File: xxx [-c]' where 'xxx' is the name of the file where the data is stored, and '-c' would indicate whether the data is compressed? Alternately, we could use the 'magic number' of the data file to determine its compression -- but uncompressed data has a random magic number, so this may be problematic. The last architectural problem is in somehow tying the data files back to the header files -- since the header files will inevitably become corrupted at some point, 'orphaning' any associated data files. They could be given similar names -- perhaps the same 'article number'-style name, but in a '.data/' directory? This would remove some of the need for the 'X-Data-File:' header, but there would still have to be something in the headers to tell the various news reading and transmission programs that the data is elsewhere -- otherwise they would think the article had no text. If every stored article had its data stored seperately, this would not be a problem; however, as a news administrator, I would prefer to have some things compressed and others not -- during the conversion, mainly, but there are other possibilities. A new flag in the LIBDIR/active file could tell whether to compress new articles added to a particular newsgroup. It occurs to be that there would be a terrible waste of CPU time if the articles were batched and compressed in the current format, only to be uncompressed, sorted out, then recompressed for storage. A new batching format is called for -- perhaps the headers and data could travel under different covers, or at a minimum, the '#! cunbatch nnn' lines could have a second argument added -- the length of the headers. Leaving the first argument as it is would make the format portable to older software on the receiving end -- but newer software could send the first 'nnn' bytes though compress, yielding the headers, then send the remainder directly to a file somewhere, to be uncompressed by the news readers. This may all be moot, given the onset of C news. I don't know if C news does this or not. It's also possible that I'm overlooking something -- there could be very good reasons why this scheme won't work, but if so, they elude me. All the news readers would have to change -- I don't know if their various implementors and maintainers would go for it. I'm aware of 'readnews', 'vnews', 'rn', and 'vn'. Then there's 'notesfiles' and all their myriad gateway software. Changing 'inews' and 'batch' starts to look easy in comparison. Perhaps this is why it hasn't been done yet? If you reply via mail, let me know whether I can quote you in the summary that I will post if I get lots of responses. Somehow I think it's more likely that this one will be either (a) shot down quickly by one of those 'good reasons' that are eluding me, or (b) hashed out severly in this and other news groups. Send mail if you want to, though -- I started it, so I'll take on the task of summarizing your comments. -- Paul A. Vixie {ptsfa, crash, winfree}!vixie!paul 329 Noe Street dual!ptsfa!vixie!paul@ucbvax.Berkeley.EDU San Francisco CA 94116 paul@vixie.UUCP (415) 864-7013
rob@briar.UUCP (Rob Robertson) (04/16/87)
There are several drawbacks to keeping news compressed inside the spool directories. The two big ones are CPU cycles and that compress only gives you real gains if what you are trying to compress is reasonable sized. On big files, such as source archives, compress will give you about a 50-60 percent savings in space. On the average length article 30-40 lines you'll get about a 20-30 percent savings (on just the article, not the header). On a one line article with compression you gain space due to the compression tables. The point being that your not going to be gaining the amount of space you might be expecting. Now for CPU cycles. Compress, especially with big articles is very expensive computationally. Each time an article is read, your going to have to invoke compress. For small articles, this is not that expensive, but for *.sources.*, your going to notice it in both user wait time and cpu time. Also, in batching news your going to have to uncompress the articles your sending to another site, and recompress them (you want BIG batches to make better use of compress, you also want compatiblity with people who do not do your method of compression). This posting is not meant to dissuade you, just point out some design considerations. There are several situations where you might want to do this. If you had severe disk limitations, alot of unused cpu cycles and/or only a few net readers. Correct me if I wrong :-). rob -- william robertson philabs!rob@seismo.css.gov (914) 945-6300 philabs!rob.uucp "indecision is the key to flexibility"
joe@auspyr.UUCP (Joe Angelo) (04/17/87)
in article <536@vixie.UUCP>, paul@vixie.UUCP (Paul Vixie Esq) says: > > What about compressing the news data in the spool directories? If compress > can save half the transmission time, it ought to be able to save almost that > much in the storage costs as well. It isn't quite the same win, since there > is the file system's frag size to consider -- but it's a win, just the same. > It's a great idea ... but on the other hand, is the decompression-to-read time worth it? I mean, sure, compressing, batching, and then uncompressing would waste alot of CPU time *if* your proposal were implemented, but that wasted time would be nothing when measured against the CPU time of decompressing the same message for 50 net-news readers (people); not to forget that THEY have to wait for the decompression as well! It seams to me that decompressing 500 messages X 50 users tends to use more CPU time then decompressing all of them once. As things are going ... communication costs are certainly more expensive (money wise) than disk storage, so what's the problem? -- "No matter Joe Angelo, Sr. Sys. Engineer @ Austec, Inc., San Jose, CA. where you go, ARPA: aussjo!joe@lll-tis-b.arpa PHONE: [408] 279-5533 there you UUCP: {sdencore,cbosgd,amdahl,ptsfa,dana}!aussjo!joe are ..." UUCP: {styx,imagen,dlb,jmr,sci,altnet}!auspyr!joe
sundquis@umn-cs.UUCP (Tom Sundquist) (04/17/87)
In article <536@vixie.UUCP> paul@vixie.UUCP (Paul Vixie Esq) writes: > >The hard part is the headers -- they should not be compressed because they >are examined independent of the (much longer) data quite often -- in expire, >in subject searches, etc. In my view, the headers would be better left >uncompressed. So we can either put compressed and uncompressed data in the I computed a few statistics about news articles to help determine how efficient such a compression scheme might be. The average article size on our system was about 2600 b. (This includes large archives in various ``source'' newsgroups). The average header size was about 550 b. The average compression rate of article bodies was roughly 50%. (This did not include large ``source'' articles.) Hence the net compression rate would be about ((2600 - 550) * 50% +550) / 2600 = 61% I.e. keeping headers uncompressed (necessary) results in a 10% loss in compression rate. I don't know if this is pathological to the argument but needs to be considered... Tom Sundquist sundquis@umn-cs.arpa rutgers!meccts!umn-cs!sundquis
henry@utzoo.UUCP (Henry Spencer) (04/18/87)
> What about compressing the news data in the spool directories? ... Apart from the issues of cpu time consumed, the slowdown in reading news, etc., remember that compress does not do nearly as well on short files as it does on long ones. Most news articles are quite short. > ... If compress > can save half the transmission time, it ought to be able to save almost that > much in the storage costs as well... A quick experiment suggests that you'll be lucky to get 25% consistently. Perhaps a smarter data-compression algorithm could do better, but *I* don't want the job of inventing one. > This may all be moot, given the onset of C news. I don't know if C news > does this or not. Nope. Geoff and I talked about this and other issues for a while. After considerable discussion and thought, we concluded that we did not know any way of storing news articles that would be *decisively* superior to the current one. Most of the ideas we kicked around had disadvantages as well as advantages, and we couldn't see anything that was enough of a net win to be worth the trouble of changing over. The specific problems of storing compressed articles (apart from the issue of header searching, which isn't needed by C expire but would be significant for the news readers) are the very modest saving in space for small files when using current compression algorithms, the awkwardness of applying general Unix tools to non-ASCII files, and above all the large overheads it would add to news reception and reading. -- "We must choose: the stars or Henry Spencer @ U of Toronto Zoology the dust. Which shall it be?" {allegra,ihnp4,decvax,pyramid}!utzoo!henry
sewilco@meccts.UUCP (04/19/87)
Most posters are assuming that "compression" requires the `compress` program or algorithm. It should not be implied (except by someone who has implemented the idea using the L-Z algorithm :-). -- Scot E. Wilcoxon (guest account) {ihnp4,amdahl,dayton}!meccts!sewilco (612)825-2607 sewilco@MECC.COM ihnp4!meccts!sewilco It may be the event of the century, but "Supernova 1987A" isn't a good catchword.