seth@ctr.columbia.edu (Seth Robertson) (10/25/89)
Hi. I run a Cnews machine with a few high-speed (NNTP) feeds. My problem is that two of them have excessive (> 80%) duplication. Here is part of one days summary file: ----------------------------------------------------------------------- NNTP reciept statistics: 4330 total articles. % % % Host Name from to dup from to dup ======================= ======= ======= ======= ======= ======= ======= cica 31.940 67.252 25.565 1383 2912 475 ginosko 11.732 32.841 80.873 508 1422 2148 gem.mps.ohio-state.edu 12.956 75.404 77.640 561 3265 1948 uakari.primate.wisc.edu 37.460 0.000 21.262 1622 0 438 ----------------------------------------------------------------------- I switched from dbz back to dbm because with dbz I was getting around 95% duplication on *all* of my feeds, but I still have some problems on my fast feeds. Does anyone have any suggestions? Thanks. -Seth Robertson seth@ctr.columbia.edu
coolidge@brutus.cs.uiuc.edu (John Coolidge) (10/26/89)
seth@ctr.columbia.edu (Seth Robertson) writes: >I run a Cnews machine with a few high-speed (NNTP) feeds. My problem is that >two of them have excessive (> 80%) duplication. >Here is part of one days summary file: > % % % >Host Name from to dup from to dup >======================= ======= ======= ======= ======= ======= ======= >cica 31.940 67.252 25.565 1383 2912 475 >ginosko 11.732 32.841 80.873 508 1422 2148 >gem.mps.ohio-state.edu 12.956 75.404 77.640 561 3265 1948 >uakari.primate.wisc.edu 37.460 0.000 21.262 1622 0 438 >----------------------------------------------------------------------- >Does anyone have any suggestions? Thanks. The problem is that with REALLY fast feeds, even processing articles once a minute is not fast enough. The problem lies in nntpd accepting multiple copies because the queue hasn't been run yet. There are a couple of possible solutions: have nntpd hand articles straight to relaynews (cuts out dups like mad, but really messes up performance), or find a way to make newsrun run more than once a minute. I've adopted the second solution by writing a newsrun daemon written in perl that runs about every 10 sec (alas, my code is nowhere near the point where I'd release it). There are a couple of possible solutions which require a lot of effort. One is to have nntpd log articles as received, and check against both that log and history before taking in a new article. The problem comes in telling nntpd that an article it logged has failed somewhere else (because of disk space, for instance). Alternatively, relaynews could be rebuilt as a daemon which takes articles as presented and writes them. This has a problem with communication and with handling crashes. --John -------------------------------------------------------------------------- John L. Coolidge Internet:coolidge@cs.uiuc.edu UUCP:uiucdcs!coolidge Of course I don't speak for the U of I (or anyone else except myself) Copyright 1989 John L. Coolidge. Copying allowed if (and only if) attributed. You may redistribute this article if and only if your recipients may as well.
hoyt@polyslo.CalPoly.EDU (Sir Hoyt) (10/26/89)
In article <1989Oct25.205129.16397@brutus.cs.uiuc.edu> coolidge@cs.uiuc.edu writes: >seth@ctr.columbia.edu (Seth Robertson) writes: >>[stuff about lots of dups deleted] >[About how how to speed up article processing by feeding directly to relynews] >I've adopted >the second solution by writing a newsrun daemon written in perl that >runs about every 10 sec (alas, my code is nowhere near the point where >I'd release it). NNTP starts newsrun on *every* batch of news that arrives. I think that if NNTP is getting alot of news, it well start up newsrun on the batch already there, then go on to recieving more news. I have not check on this, just what I have observed. I for one see little use in running newsrun every 10 sec. Especially since NNTP incoming batches are named: nntp.XXXXXX and newrun looks for batches of the name: XXXXXXXXX. X = some numbers. And also becuase NNTP processes the batches right away. There is also another problem between C news and NNTP. This has to do with the slight change in history format. If you ever see the message "malformed history...." form NNTP you just accpeted an dup article. NNTP have a routine called gethistinfo() ( or something like that ). And C news history file format makes it fail. So gethistinfo() tells NNTP to accept the article. NNTP needs some more work to work correctly with C news..... more things to do.... P.S. I am talking about NNTP 1.5.6 -- John H. Pochmara A career is great, UUCP: {sdsu,voder,trwind}!polyslo!hoyt But you can't run your Internet: hoyt@polyslo.CalPoly.EDU fingers through its hair -Graffiti 4/13/83
coolidge@brutus.cs.uiuc.edu (John Coolidge) (10/26/89)
hoyt@polyslo.CalPoly.EDU (Sir Hoyt) writes: >In article <1989Oct25.205129.16397@brutus.cs.uiuc.edu> coolidge@cs.uiuc.edu writes: >>seth@ctr.columbia.edu (Seth Robertson) writes: >>>[stuff about lots of dups deleted] >>[About how how to speed up article processing by feeding directly to relynews] >>I've adopted >>the second solution by writing a newsrun daemon written in perl that >>runs about every 10 sec (alas, my code is nowhere near the point where >>I'd release it). > NNTP starts newsrun on *every* batch of news that arrives. > I think that if NNTP is getting alot of news, it well > start up newsrun on the batch already there, then go on to > recieving more news. I have not check on this, just what > I have observed. [ARGH! Beat head semi-violently against wall :-)] Of course you're right for the default setup. I've had that code turned off for so long now that I forgot it was there. Why? Because with 8-10 active nntpd's all giving me news, that's 8-10 invocations of newsrun EVERY MINUTE! (Actually, it's not that bad since some of my connections leave the nntp connection open continuously --- but that opens up an entirely different can of works wrt nntpd). Sure, newsrun locks against itself, but starting 8-10 fairly large shell scripts per minute is not a good thing. It was FAR more efficient to just run newsrun once a minute from cron --- except that this really pushes up the number of dups. Thus my every-10(actually 5)-sec newsrun daemon. There's also a bug that we noticed in SunOS 4.0.3 involving execing things from nntpd --- it had a bad effect on network memory. With 8-10 active nntpds, portmap would crash after a while, effectively crashing the machine. > I for one see little use in running newsrun every 10 sec. > Especially since NNTP incoming batches are named: nntp.XXXXXX > and newrun looks for batches of the name: XXXXXXXXX. > X = some numbers. And also becuase NNTP processes the batches > right away. NNTP renames its batches from nntp.XXXXXX (where X is the pid of the nntpd) to XXXXXXXXXXX (timestamp of the batch) just before the exec of newsrun in the default setup. My current setup is about as optimized as possible wrt processing articles fast: nntpd writes each article to a separate batch file (I changed the naming scheme to avoid conflicts when more than one article appears in a second). Newsrund comes along every 5 sec and feeds all of the current batches down a pipe into relaynews (this is a big win every time more than one batch is ready -- a common thing here -- since relaynews is fairly expensive to start). The overall load caused by this system appears empirically to be about 1/2 that of doing standard batching and once-per-minute runs of newsrun, and about 1/5 that of running newsrun from nntpd. It's probably not a huge win for sites with <5 feeds, but for us (10 feeds, 5 top-20) it's the only way to go. The increased file system traffic caused by writing each article as a batch trades off with the traffic saved in dups, and the net result is slightly lower load and much faster propagation. > There is also another problem between C news and NNTP. This > has to do with the slight change in history format. If > you ever see the message "malformed history...." form > NNTP you just accpeted an dup article. This is caused by expired articles: nntpd for some reason wants to check for a filename in history, even if the article isn't going to be received anyway. There's a simple patch for this that was posted a while back (I thought it made it into 1.5.6, but I guess not). I can ship it out on request. --John -------------------------------------------------------------------------- John L. Coolidge Internet:coolidge@cs.uiuc.edu UUCP:uiucdcs!coolidge Of course I don't speak for the U of I (or anyone else except myself) Copyright 1989 John L. Coolidge. Copying allowed if (and only if) attributed. You may redistribute this article if and only if your recipients may as well.
henry@utzoo.uucp (Henry Spencer) (10/26/89)
In article <1989Oct25.205129.16397@brutus.cs.uiuc.edu> coolidge@cs.uiuc.edu writes: >>I run a Cnews machine with a few high-speed (NNTP) feeds. My problem is that >>two of them have excessive (> 80%) duplication. > >The problem is that with REALLY fast feeds, even processing articles >once a minute is not fast enough. The problem lies in nntpd accepting >multiple copies because the queue hasn't been run yet... The fundamental problem, however, goes even deeper. There are two inherently conflicting desires: 1. Minimum processing latency, so that an article which has arrived is known as soon as possible, e.g. so that it need not be brought in again. This pushes you towards processing individual articles the instant they show up. 2. Minimum processing overhead, so that machine resources devoted to news are minimized. One major way of doing this -- one of C News's biggest performance wins over B News -- is to amortize setup and teardown overhead over multiple articles. This means delaying processing until several articles can be run as a batch. There is NO WAY to satisfy both of these desires simultaneously. All you can do is strike some sort of compromise, depending on your own priorities. In particular, if you have very fast feeds and optimize for minimum overhead, you will inevitably receive lots of articles more than once, although C News will throw away the duplicates quite efficiently. (I don't really see that there is cause for great alarm about efficiently-discarded duplicates.) C News is generally slanted towards minimum overhead, given our observation that B News was increasingly eating our machines alive doing one article at a time. (Incidentally, the oft-seen suggestion of running relaynews as a daemon doesn't really help very much. An article is not really received until it, its history-file line, and the update to the relevant active-file line(s), are flushed out to disk. Flushing data to disk is a big part of the setup/teardown overhead. If you do it once per article, the overhead goes way up. If you batch the disk flushing, you're back to having a significant window in which the article has been received but this fact is not universally and positively known yet. The relaynews daemon might be a net improvement, but it is *not* an escape from the fundamental dilemma.) -- A bit of tolerance is worth a | Henry Spencer at U of Toronto Zoology megabyte of flaming. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
rick@uunet.UU.NET (Rick Adams) (10/27/89)
Wouldn't having nntpd keep a history file of the articles it had accepted that day give you both ? E.g when nntpd processes an IHAVE, it checks to see if its in the regular history file AND in its own history file. The nntpd history file could be purged every time expire ran. This way you could run the batches as slowly as you wanted without getting duplicates. --rick
Reply-To: coolidge@brutus.cs.uiuc.edu (10/27/89)
henry@utzoo.uucp (Henry Spencer) writes: >In article <1989Oct25.205129.16397@brutus.cs.uiuc.edu> coolidge@cs.uiuc.edu writes: >>The problem is that with REALLY fast feeds, even processing articles >>once a minute is not fast enough. The problem lies in nntpd accepting >>multiple copies because the queue hasn't been run yet... >The fundamental problem, however, goes even deeper. There are two inherently >conflicting desires: >1. Minimum processing latency. [...] >2. Minimum processing overhead. [...] >There is NO WAY to satisfy both of these desires simultaneously. All you >can do is strike some sort of compromise, depending on your own priorities. That's in general true. I've found a specific exception, but I think that's mainly because of the vast difference in cost of 1) using perl and not sh (at the cost of some portability, but not too much), and 2) starting just one relaynews per pass over the incoming directory, rather then one per file as the default newsrun does. In any case, my system (write each incoming article as a separate file, then feed them all down a pipe into one relaynews) seems to have done a great job of minimizing both 1 and 2. For the moment, most of my compromising and hacking have been aimed explicitly at goal (1). Goal (2) has come along as an unexpected bonus. >In particular, if you have very fast feeds and optimize for minimum overhead, >you will inevitably receive lots of articles more than once, although C News >will throw away the duplicates quite efficiently. (I don't really see that >there is cause for great alarm about efficiently-discarded duplicates.) Quite right. Duplicates mainly cost in doing some extra disk traffic (when I get duplicates, it was generally an entire batch full --- now each duplicate, like everything else, is a separate file). They don't cost very much, but I still prefer to keep them down, in part because it's a sign that our processing latency is low. >C News is generally slanted towards minimum overhead, given our observation >that B News was increasingly eating our machines alive doing one article at >a time. Thanks for the work! With all of the performance hacks I've put in, most of the speed of our system (Sun 3/160, SunOS 4.0.3) is due to C News. The overall cost of news on our machine is very low, so low that we can use it as a YP server and user machine along with running news and still have a reasonable response time for all functions. >(Incidentally, the oft-seen suggestion of running relaynews as a daemon >doesn't really help very much. An article is not really received until >it, its history-file line, and the update to the relevant active-file >line(s), are flushed out to disk. Flushing data to disk is a big part >of the setup/teardown overhead. If you do it once per article, the >overhead goes way up. If you batch the disk flushing, you're back to >having a significant window in which the article has been received but >this fact is not universally and positively known yet. The relaynews >daemon might be a net improvement, but it is *not* an escape from the >fundamental dilemma.) This I'm not so sure of (but you've probably measured it, so...). On the other hand, it's another part of why running one relaynews per article (or even batch) isn't a good idea. Relaynews costs: in opening active, history, and sys, in writing out the article, and in updating active and history. --John -------------------------------------------------------------------------- John L. Coolidge Internet:coolidge@cs.uiuc.edu UUCP:uiucdcs!coolidge Of course I don't speak for the U of I (or anyone else except myself) Copyright 1989 John L. Coolidge. Copying allowed if (and only if) attributed. You may redistribute this article if and only if your recipients may as well.
coolidge@brutus.cs.uiuc.edu (John Coolidge) (10/27/89)
[Apologies for reposting --- first copy was cancelled due to header trouble] henry@utzoo.uucp (Henry Spencer) writes: >In article <1989Oct25.205129.16397@brutus.cs.uiuc.edu> coolidge@cs.uiuc.edu writes: >>The problem is that with REALLY fast feeds, even processing articles >>once a minute is not fast enough. The problem lies in nntpd accepting >>multiple copies because the queue hasn't been run yet... >The fundamental problem, however, goes even deeper. There are two inherently >conflicting desires: >1. Minimum processing latency. [...] >2. Minimum processing overhead. [...] >There is NO WAY to satisfy both of these desires simultaneously. All you >can do is strike some sort of compromise, depending on your own priorities. That's in general true. I've found a specific exception, but I think that's mainly because of the vast difference in cost of 1) using perl and not sh (at the cost of some portability, but not too much), and 2) starting just one relaynews per pass over the incoming directory, rather then one per file as the default newsrun does. In any case, my system (write each incoming article as a separate file, then feed them all down a pipe into one relaynews) seems to have done a great job of minimizing both 1 and 2. For the moment, most of my compromising and hacking have been aimed explicitly at goal (1). Goal (2) has come along as an unexpected bonus. >In particular, if you have very fast feeds and optimize for minimum overhead, >you will inevitably receive lots of articles more than once, although C News >will throw away the duplicates quite efficiently. (I don't really see that >there is cause for great alarm about efficiently-discarded duplicates.) Quite right. Duplicates mainly cost in doing some extra disk traffic (when I get duplicates, it was generally an entire batch full --- now each duplicate, like everything else, is a separate file). They don't cost very much, but I still prefer to keep them down, in part because it's a sign that our processing latency is low. >C News is generally slanted towards minimum overhead, given our observation >that B News was increasingly eating our machines alive doing one article at >a time. Thanks for the work! With all of the performance hacks I've put in, most of the speed of our system (Sun 3/160, SunOS 4.0.3) is due to C News. The overall cost of news on our machine is very low, so low that we can use it as a YP server and user machine along with running news and still have a reasonable response time for all functions. >(Incidentally, the oft-seen suggestion of running relaynews as a daemon >doesn't really help very much. An article is not really received until >it, its history-file line, and the update to the relevant active-file >line(s), are flushed out to disk. Flushing data to disk is a big part >of the setup/teardown overhead. If you do it once per article, the >overhead goes way up. If you batch the disk flushing, you're back to >having a significant window in which the article has been received but >this fact is not universally and positively known yet. The relaynews >daemon might be a net improvement, but it is *not* an escape from the >fundamental dilemma.) This I'm not so sure of (but you've probably measured it, so...). On the other hand, it's another part of why running one relaynews per article (or even batch) isn't a good idea. Relaynews costs: in opening active, history, and sys, in writing out the article, and in updating active and history. --John -------------------------------------------------------------------------- John L. Coolidge Internet:coolidge@cs.uiuc.edu UUCP:uiucdcs!coolidge Of course I don't speak for the U of I (or anyone else except myself) Copyright 1989 John L. Coolidge. Copying allowed if (and only if) attributed. You may redistribute this article if and only if your recipients may as well.
zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (10/27/89)
>>(Incidentally, the oft-seen suggestion of running relaynews as a daemon >>doesn't really help very much. An article is not really received until >>it, its history-file line, and the update to the relevant active-file >>line(s), are flushed out to disk. Flushing data to disk is a big part >>of the setup/teardown overhead. If you do it once per article, the >>overhead goes way up. I don't believe this. The per batch overhead in C news is huge compared to a couple of flushes per article. Articles themselves are always flushed to disk on a per article basis. -- Branch Technology <zeeff@b-tech.ann-arbor.mi.us>
henry@utzoo.uucp (Henry Spencer) (10/27/89)
In article <1989Oct27.012640.27706@brutus.cs.uiuc.edu> coolidge@cs.uiuc.edu writes: >2) starting just one relaynews per pass over the incoming directory, >rather then one per file as the default newsrun does. In any case, >my system (write each incoming article as a separate file, then >feed them all down a pipe into one relaynews) seems to have done >a great job... In general, the bigger the lump you can feed to relaynews, the better. The really-ambitious could "cat * | relaynews" (more or less) rather than using the "for" loop of the current "newsrun". One thing you do lose, though, is error recovery -- it is no longer possible to localize problems to a single file. One can imagine a hybrid approach that would pour it all in, and then *if* there was an error message, back off and push the files in one at a time to see who was to blame (since feeding the same stuff through twice is generally harmless, and also fairly quick [rejection of duplicates is really fast]). At the time it didn't seem worth the trouble. -- A bit of tolerance is worth a | Henry Spencer at U of Toronto Zoology megabyte of flaming. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
flee@shire.cs.psu.edu (Felix Lee) (10/28/89)
Rick Adams <rick@uunet.UU.NET> writes: > Wouldn't having nntpd keep a history file of the articles it had > accepted that day give you both ? Yes, except you need a "maybe-I-have" NNTP response, if you want reliable article transmission. "Maybe-I-have" means you should try offering the article to me later, and see what I say. -- Felix Lee flee@shire.cs.psu.edu *!psuvax1!flee
coolidge@brutus.cs.uiuc.edu (John Coolidge) (10/28/89)
flee@shire.cs.psu.edu (Felix Lee) writes: >Rick Adams <rick@uunet.UU.NET> writes: >> Wouldn't having nntpd keep a history file of the articles it had >> accepted that day give you both ? >Yes, except you need a "maybe-I-have" NNTP response, if you want >reliable article transmission. "Maybe-I-have" means you should try >offering the article to me later, and see what I say. It's in the NNTP spec, more or less. Just use NNTP response code 436 (transfer failed - try again later). It's not entirely the truth, but it does achieve the same result. On the other hand, I'm not at all sure that anything actually implements this feature... --John -------------------------------------------------------------------------- John L. Coolidge Internet:coolidge@cs.uiuc.edu UUCP:uiucdcs!coolidge Of course I don't speak for the U of I (or anyone else except myself) Copyright 1989 John L. Coolidge. Copying allowed if (and only if) attributed. You may redistribute this article if and only if your recipients may as well.
coolidge@brutus.cs.uiuc.edu (John Coolidge) (10/28/89)
henry@utzoo.uucp (Henry Spencer) writes: >In article <1989Oct27.012640.27706@brutus.cs.uiuc.edu> coolidge@cs.uiuc.edu writes: >>[...]my system (write each incoming article as a separate file, then >>feed them all down a pipe into one relaynews) seems to have done >>a great job... >In general, the bigger the lump you can feed to relaynews, the better. >The really-ambitious could "cat * | relaynews" (more or less) rather than >using the "for" loop of the current "newsrun". This is more-or-less what I do. The current work-in-process is a C newsrund that does: foreach file in the incoming directory, send it down a pipe into a waiting relaynews (started after it's determined that there is at least one article to be received). I have an alpha version that does exactly that (and nothing else: no locking, no error logging, _nothing_). >One thing you do lose, >though, is error recovery -- it is no longer possible to localize problems >to a single file. One can imagine a hybrid approach that would pour it >all in, and then *if* there was an error message, back off and push the >files in one at a time to see who was to blame (since feeding the same >stuff through twice is generally harmless, and also fairly quick [rejection >of duplicates is really fast]). At the time it didn't seem worth the >trouble. I'd already granted losing some error recovery. I'm not planning to be as cautious about disk space as the C News authors were. I'm not planning to catch errors as neatly. Actually I hadn't thought of this roll-back scheme --- seems like a very good idea that I may try to implement. It could be further optimized by adding some sort of "article key" down the pipe into relaynews (perhaps #! key <n>, with <n> being the batch file name) --- this makes things extremely simple. It also increases the workload for relaynews, but perhaps not too much. All of my changes are intended to feed the performance monster. If at all possible, I want to feed both mouths of the performance beast --- the propagation delay mouth and the system load mouth. If I can't feed both, propagation delay gets a little extra weight on the balance, but not much --- I can't bring our news machine to its knees (or anywhere near). If I lose some error recovery along the way, then I'm willing to pay the price to get the performance. A lot of people aren't, and frankly I'm not too sure that there exist all that many sites who really need to optimize the sorts of things I'm optimizing. Most sites just want news delivered at a reasonable pace. I suspect most sites get news at most hourly and more commonly once or twice a day. On the other hand, there are the NNTP high-speed feeds (like me!) who want all the articles processed yesterday and everything done double-quick. The occasional odd failed batch (and I get very, very few of them) can be diverted into the slow lane for attention, but most of the news should be in and out ASAP. Dropping a bad batch is even often an option; I've got 9 other people all trying to give the same articles to me... Anyway, there are several classes of performance patches: 1) Those that improve performance without any cost (in portability, recovery, or anything else). Clearly these should go everywhere. Here lie things like Chris Myers' nntpxmit patches (to put the message-id in the batch file and have nntpxmit use it) and David Robinson's ihave patch (remove a useless file read in nntpd's ihave processing). 2) Those that improve performance at the cost of some portability, but no loss in recovery or anything else. Things like writing spacefor and newsrun in C come under this heading. Very much worth it if done as an optimization, and if a portable alternative is available to those sites who can't run the optimized version. 3) Those that improve performance at the cost of recovery, but not at the cost of portability. Changing the code in the stock newsrun to do cat * | relaynews is an example of this. Useful, but only if the people using it know what they're giving up. 4) Those that improve performance at the cost of both recovery and portability. Things like my newsrund-in-progress do this, since I'm both working in C and doing things like cat * | relaynews. Useful only if it'll run where you are and if you are willing to give up the reliability. 5) And, of course, things that _don't_ improve performance, even though someone thought they might. I've tried a number of experiments that just didn't yield the benefits I thought they would. Oh well, back to the drawing board... I guess that's enough of the soapbox for now. There's a fair bit of philosophical ground here. Geoff and Henry have provided a very high performance, yet portable and reliable news system. Those of us who still aren't satisfied can easily go in and hack things up to run even faster --- but it's up to us to watch our steps and not break something important. --John -------------------------------------------------------------------------- John L. Coolidge Internet:coolidge@cs.uiuc.edu UUCP:uiucdcs!coolidge Of course I don't speak for the U of I (or anyone else except myself) Copyright 1989 John L. Coolidge. Copying allowed if (and only if) attributed. You may redistribute this article if and only if your recipients may as well.