[net.news.b] Sorting batches helpful?

sewilco@mecc.UUCP (Scot E. Wilcoxon) (06/17/86)

It has occurred to me that disk I/O on news articles may be a little
more efficient if articles tend to be stored sequentially on disk.
I think there's a simple way of encouraging it as well...

	When batching news, have sendbatch sort the batch file.

The articles will get sorted alphabetically (the pathnames are sorted).
When the batch is received, all articles in each newsgroup will be together.
Depending upon the disk allocation scheme, they will tend to go sequentially
in clusters on the disk.  So when the articles are read, retransmitted, or
expired, the activity will tend to be somewhat more sequential than it is now.

Right now, articles get passed on in the order in which they are received.
I'm sure few people read news that way.  Much news is handled in
newsgroup alphabetical and article sequential order, which is what the
sort will do.

Note that your system doesn't get the benefits of the sort, but rather
the systems which receive your batch (although they'll pass the benefits).
Of course, if you do it you should ask them to do it also.

Other than the system-dependent disk allocation, are there any
problems with this idea?
-- 

Scot E. Wilcoxon  Minn. Ed. Comp. Corp.             quest!mecc!sewilco
45 03 N / 93 08 W   (612)481-3507         {ihnp4,philabs}!mecc!sewilco
Successful road sign: [Accident Reduction Project Area] [down 36%]

roy@phri.UUCP (06/19/86)

In article <510@mecc.UUCP> sewilco@mecc.UUCP (Scot E. Wilcoxon) suggests
having sendbatch sort news batches by article filenames.  Scot hopes that
this would lead to better disk performance on the receiving end.

	Why is it that somebody always scoops me on good ideas?  Anyway,
I've been thinking about sorted batches for the past couple of weeks, but
for a different reason.

	We compress our outgoing news batches.  The compress algorithm
depends on finding repeated strings which it can collapse into one copy of
the string and a pointer back to the same copy the next time it sees it
(sort of; read the _Computer_ article for more details).  You could think
of compress as a software LRU cache.  Caches perform better as locality of
reference increases.  So, it occurred to me, if you sort the list of
filenames to be batched, you will tend to get all the articles in a single
news group in the same batch.  Now all those long included quotes just
become grist for compress's cache.

	"OK", you say, "that's a nifty idea; does it really work?"  Well,
unfortunately, at this point, I'm going to have to waffle a bit on the
answer.  I did a few quick tests and the results were not encouraging.  Two
weeks ago, I took one day's worth of news and ran it through batch and
compress, with and without sorting the filenames.  The sorted batches were
indeed smaller, but only about 5-10% so.  Considering it doesn't take much
to sort a list of a few hundred file names I guess this is a win, but not
the big win I had hoped for.
-- 
Roy Smith, {allegra,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

jerry@oliveb.UUCP (06/21/86)

In article <510@mecc.UUCP> sewilco@mecc.UUCP (Scot E. Wilcoxon) writes:
>	When batching news, have sendbatch sort the batch file.
>
>Right now, articles get passed on in the order in which they are received.

Unfortunately that is not true.  For sites using batching it is mostly
true.  For others it is not true at all.  There is no guarantee that
UUCP will send jobs in the order that they are queued.  Also "L" news
feeds can result in a locally posted article getting faster
distribution.

Other people have discussed the various problems with articles showing
up out of sequence.  It is VERY common for a followup to an article to
arrive before the original.

I would suggest sorting the batches by the article posting date.  In
addition to moving the followups after the parent, it would tend to
shorten the worst case transmission time as older articles would have
priority.

There are two possible problems with this.  The most obvious of these is
the extra overhead associated with sorting.  In particular, having to
read and parse the header of all the articles to get their posting dates
would slow down the batching process considerably.  An alternative would
be to have rnews set the modification time of created articles to the
posting date of the article.  This could be done with trivial overhead
and the sort would require doing only an fstat(2) of the files, much
faster than parsing headers.  Having the mtime of the articles set to
the posting time would probably be useful for other purposes.  (Having
the "time-warp" articles show up with a date that hasn't happened yet
should be fun.)

The second problem would be the recurring one of some site suddenly
flooding the net with very old articles.  As the new versions of news
will junk articles which are over the expiration date this should not be
too much of a problem.

Comments?

					Jerry Aguirre @ Olivetti ATC
{hplabs|fortune|idi|ihnp4|tolerant|allegra|glacier|olhqma}!oliveb!jerry

mangler@cit-vax.UUCP (06/26/86)

Alphabetic sorting of the batch file will send the group "control"
before just about anything else.  Thus, cancel messages would almost
always arrive before the article they're supposed to cancel.

(Are canceled articles transmitted?  Hard to take them out of the
batch file once they're in...)

Don Speck	speck@vlsi.caltech.edu	seismo!cit-vax!speck

jerry@oliveb.UUCP (Jerry Aguirre) (06/27/86)

>Alphabetic sorting of the batch file will send the group "control"
>before just about anything else.  Thus, cancel messages would almost
>always arrive before the article they're supposed to cancel.
>
>(Are canceled articles transmitted?  Hard to take them out of the
>batch file once they're in...)
>
>Don Speck	speck@vlsi.caltech.edu	seismo!cit-vax!speck

The articles may be listed in the batch file but if they have been
canceled then they have been removed from the system and the batcher
won't be able to find them.  If you have ever monitored the batcher you
will get messages complaining about missing files.  These are presumably
articles that were canceled.

If they have already been batched then they are past the sorting point
anyway.

I still think they should be sorted by posting date, not file name.

					Jerry Aguirre @ Olivetti ATC
{hplabs|fortune|idi|ihnp4|tolerant|allegra|glacier|olhqma}!oliveb!jerry

fair@styx.UUCP (Erik E. Fair) (06/27/86)

The problem of cancel control messages arriving before the article that
they intend to cancel is solved in the later versions of news after
2.10.2 by making an entry in the history file. When the article arrives
later, it will be rejected as a duplicate.

	Erik E. Fair	styx!fair	fair@lll-tis-b.arpa

mangler@cit-vax.Caltech.Edu (System Mangler) (06/28/86)

In article <2376@phri.UUCP>, roy@phri.UUCP (Roy Smith) writes:
> [...] I took one day's worth of news and ran it through batch and
> compress, with and without sorting the filenames.  The sorted batches were
> indeed smaller, but only about 5-10% so. [...]

Roy's article compresses to 70% of its original size.  Two copies
concatenated compresses to 60%.  So even quoting an entire article
doesn't do much to the compression ratio.  Getting a 5-10% better
compression ratio is awfully good, I think.  I guess I'll try Roy's
suggestion of sorting the filenames.

Perhaps the compression ratio is affected more by the size of the
batch than by the degree of repetition... What's the limit on the
size of a batch?

Don Speck	speck@vlsi.caltech.edu

ahby@meccts.UUCP (Shane P. McCarron) (06/28/86)

In article <764@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (System Mangler) writes:
>Roy's article compresses to 70% of its original size.  Two copies
>concatenated compresses to 60%.  So even quoting an entire article
>doesn't do much to the compression ratio.  Getting a 5-10% better
>compression ratio is awfully good, I think.  I guess I'll try Roy's
>suggestion of sorting the filenames.

Another thing which could help in this would be to not just sort the
batch file, but actually group articles by subject.  This will add
some CPU overhead to the batching process, but if like articles are
grouped together, compress should have a better change of reducing the
size.  I am writing some code to do this, and will post it
eventually.
-- 
Shane P. McCarron			UUCP	ihnp4!meccts!ahby
MECC Technical Services			ATT	(612) 481-3589

wedgingt@udenva.UUCP (Will Edgington/Ejeo) (07/01/86)

In article <2376@phri.UUCP>, roy@phri.UUCP (Roy Smith) writes:
> [...] I took one day's worth of news and ran it through batch and
> compress, with and without sorting the filenames.  The sorted batches were
> indeed smaller, but only about 5-10% so. [...]

  There's another reason to do so with BSD 4.3 (and presumably Ultrix v1.2):
sorting the filenames will take direct advantage of kernel improvements in
directory searching (instead of always returning to the start of a directory
to search for a file, the kernel now starts where the last search left off
via a cacheing scheme).  Also note that sorting by subject will *NOT* take
full advantage of the new kernel routines.
-- 
------------------------------------------------------------------------
|"Input!"|"Number 5 is Alive!!"|"Beautiful software"|"Not disassemble!!"
------------------------------------------------------------------------
Will Edgington|Computing and Information Resources|University of Denver|
(303) 871-2081|Work: BusAd 469, 2020 S. Race, Denver CO 80210/----------
(303) 722-5738|Home: 2035 So. Josephine #102, Denver CO 80210|BSD 4.2 >
-------------------------------------------------------------/ System 5
{{hplabs,seismo}!hao,ucbvax!nbires,boulder,cires,cisden}!udenva!wedgingt
COMING SOON: wedgingt@eos.cair.du.edu, wedgingt@eos.BITNET ( == udenva )