sewilco@mecc.UUCP (Scot E. Wilcoxon) (06/17/86)
It has occurred to me that disk I/O on news articles may be a little more efficient if articles tend to be stored sequentially on disk. I think there's a simple way of encouraging it as well... When batching news, have sendbatch sort the batch file. The articles will get sorted alphabetically (the pathnames are sorted). When the batch is received, all articles in each newsgroup will be together. Depending upon the disk allocation scheme, they will tend to go sequentially in clusters on the disk. So when the articles are read, retransmitted, or expired, the activity will tend to be somewhat more sequential than it is now. Right now, articles get passed on in the order in which they are received. I'm sure few people read news that way. Much news is handled in newsgroup alphabetical and article sequential order, which is what the sort will do. Note that your system doesn't get the benefits of the sort, but rather the systems which receive your batch (although they'll pass the benefits). Of course, if you do it you should ask them to do it also. Other than the system-dependent disk allocation, are there any problems with this idea? -- Scot E. Wilcoxon Minn. Ed. Comp. Corp. quest!mecc!sewilco 45 03 N / 93 08 W (612)481-3507 {ihnp4,philabs}!mecc!sewilco Successful road sign: [Accident Reduction Project Area] [down 36%]
roy@phri.UUCP (06/19/86)
In article <510@mecc.UUCP> sewilco@mecc.UUCP (Scot E. Wilcoxon) suggests
having sendbatch sort news batches by article filenames. Scot hopes that
this would lead to better disk performance on the receiving end.
Why is it that somebody always scoops me on good ideas? Anyway,
I've been thinking about sorted batches for the past couple of weeks, but
for a different reason.
We compress our outgoing news batches. The compress algorithm
depends on finding repeated strings which it can collapse into one copy of
the string and a pointer back to the same copy the next time it sees it
(sort of; read the _Computer_ article for more details). You could think
of compress as a software LRU cache. Caches perform better as locality of
reference increases. So, it occurred to me, if you sort the list of
filenames to be batched, you will tend to get all the articles in a single
news group in the same batch. Now all those long included quotes just
become grist for compress's cache.
"OK", you say, "that's a nifty idea; does it really work?" Well,
unfortunately, at this point, I'm going to have to waffle a bit on the
answer. I did a few quick tests and the results were not encouraging. Two
weeks ago, I took one day's worth of news and ran it through batch and
compress, with and without sorting the filenames. The sorted batches were
indeed smaller, but only about 5-10% so. Considering it doesn't take much
to sort a list of a few hundred file names I guess this is a win, but not
the big win I had hoped for.
--
Roy Smith, {allegra,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016
jerry@oliveb.UUCP (06/21/86)
In article <510@mecc.UUCP> sewilco@mecc.UUCP (Scot E. Wilcoxon) writes: > When batching news, have sendbatch sort the batch file. > >Right now, articles get passed on in the order in which they are received. Unfortunately that is not true. For sites using batching it is mostly true. For others it is not true at all. There is no guarantee that UUCP will send jobs in the order that they are queued. Also "L" news feeds can result in a locally posted article getting faster distribution. Other people have discussed the various problems with articles showing up out of sequence. It is VERY common for a followup to an article to arrive before the original. I would suggest sorting the batches by the article posting date. In addition to moving the followups after the parent, it would tend to shorten the worst case transmission time as older articles would have priority. There are two possible problems with this. The most obvious of these is the extra overhead associated with sorting. In particular, having to read and parse the header of all the articles to get their posting dates would slow down the batching process considerably. An alternative would be to have rnews set the modification time of created articles to the posting date of the article. This could be done with trivial overhead and the sort would require doing only an fstat(2) of the files, much faster than parsing headers. Having the mtime of the articles set to the posting time would probably be useful for other purposes. (Having the "time-warp" articles show up with a date that hasn't happened yet should be fun.) The second problem would be the recurring one of some site suddenly flooding the net with very old articles. As the new versions of news will junk articles which are over the expiration date this should not be too much of a problem. Comments? Jerry Aguirre @ Olivetti ATC {hplabs|fortune|idi|ihnp4|tolerant|allegra|glacier|olhqma}!oliveb!jerry
mangler@cit-vax.UUCP (06/26/86)
Alphabetic sorting of the batch file will send the group "control" before just about anything else. Thus, cancel messages would almost always arrive before the article they're supposed to cancel. (Are canceled articles transmitted? Hard to take them out of the batch file once they're in...) Don Speck speck@vlsi.caltech.edu seismo!cit-vax!speck
jerry@oliveb.UUCP (Jerry Aguirre) (06/27/86)
>Alphabetic sorting of the batch file will send the group "control" >before just about anything else. Thus, cancel messages would almost >always arrive before the article they're supposed to cancel. > >(Are canceled articles transmitted? Hard to take them out of the >batch file once they're in...) > >Don Speck speck@vlsi.caltech.edu seismo!cit-vax!speck The articles may be listed in the batch file but if they have been canceled then they have been removed from the system and the batcher won't be able to find them. If you have ever monitored the batcher you will get messages complaining about missing files. These are presumably articles that were canceled. If they have already been batched then they are past the sorting point anyway. I still think they should be sorted by posting date, not file name. Jerry Aguirre @ Olivetti ATC {hplabs|fortune|idi|ihnp4|tolerant|allegra|glacier|olhqma}!oliveb!jerry
fair@styx.UUCP (Erik E. Fair) (06/27/86)
The problem of cancel control messages arriving before the article that they intend to cancel is solved in the later versions of news after 2.10.2 by making an entry in the history file. When the article arrives later, it will be rejected as a duplicate. Erik E. Fair styx!fair fair@lll-tis-b.arpa
mangler@cit-vax.Caltech.Edu (System Mangler) (06/28/86)
In article <2376@phri.UUCP>, roy@phri.UUCP (Roy Smith) writes: > [...] I took one day's worth of news and ran it through batch and > compress, with and without sorting the filenames. The sorted batches were > indeed smaller, but only about 5-10% so. [...] Roy's article compresses to 70% of its original size. Two copies concatenated compresses to 60%. So even quoting an entire article doesn't do much to the compression ratio. Getting a 5-10% better compression ratio is awfully good, I think. I guess I'll try Roy's suggestion of sorting the filenames. Perhaps the compression ratio is affected more by the size of the batch than by the degree of repetition... What's the limit on the size of a batch? Don Speck speck@vlsi.caltech.edu
ahby@meccts.UUCP (Shane P. McCarron) (06/28/86)
In article <764@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (System Mangler) writes: >Roy's article compresses to 70% of its original size. Two copies >concatenated compresses to 60%. So even quoting an entire article >doesn't do much to the compression ratio. Getting a 5-10% better >compression ratio is awfully good, I think. I guess I'll try Roy's >suggestion of sorting the filenames. Another thing which could help in this would be to not just sort the batch file, but actually group articles by subject. This will add some CPU overhead to the batching process, but if like articles are grouped together, compress should have a better change of reducing the size. I am writing some code to do this, and will post it eventually. -- Shane P. McCarron UUCP ihnp4!meccts!ahby MECC Technical Services ATT (612) 481-3589
wedgingt@udenva.UUCP (Will Edgington/Ejeo) (07/01/86)
In article <2376@phri.UUCP>, roy@phri.UUCP (Roy Smith) writes: > [...] I took one day's worth of news and ran it through batch and > compress, with and without sorting the filenames. The sorted batches were > indeed smaller, but only about 5-10% so. [...] There's another reason to do so with BSD 4.3 (and presumably Ultrix v1.2): sorting the filenames will take direct advantage of kernel improvements in directory searching (instead of always returning to the start of a directory to search for a file, the kernel now starts where the last search left off via a cacheing scheme). Also note that sorting by subject will *NOT* take full advantage of the new kernel routines. -- ------------------------------------------------------------------------ |"Input!"|"Number 5 is Alive!!"|"Beautiful software"|"Not disassemble!!" ------------------------------------------------------------------------ Will Edgington|Computing and Information Resources|University of Denver| (303) 871-2081|Work: BusAd 469, 2020 S. Race, Denver CO 80210/---------- (303) 722-5738|Home: 2035 So. Josephine #102, Denver CO 80210|BSD 4.2 > -------------------------------------------------------------/ System 5 {{hplabs,seismo}!hao,ucbvax!nbires,boulder,cires,cisden}!udenva!wedgingt COMING SOON: wedgingt@eos.cair.du.edu, wedgingt@eos.BITNET ( == udenva )