pete@mimsy.umd.edu (Pete Cottrell) (09/23/90)
I am running Bnews 2.11.14 and have HISTEXP set to 28 days. One of my major feeds is running Cnews and several times in the last few weeks has been feeding me lots of old articles. One time I got about 10,000 in 2 or 3 days, and I have gotten several thousand more in the last 3 days. The latest batch are mostly 30-40 days old, but I've seen some going back to July and June. They all have the same Path leading up to mimsy: mimsy!haven!decuac!e2big.mko.dec.com!bacchus.pa.dec.com!decwrl On the other side of decwrl are various hosts, so it seems a good bet that one of the DEC machines listed above is recycling the news. I have heard that Cnews doesn't have any mechanism to junk the old news and prevent its propogation, but I don't know for sure since I'm not familiar with it (I'll probably go to it sometime this fall - I seem to remember Henry stating that they would be reorganizing the release and was waiting for that. Any updates on this?). So, Cnews experts, how does Cnews handle the question of old news? Is there anything I or my feeds can do to prevent my news partition filling up with junk? -- Spoken: Pete Cottrell UUCP: uunet!mimsy!pete INTERNET: pete@cs.umd.edu Phone: 301-454-2025 USPS: U of Maryland, CS Dept., College Park, Md 20742
henry@zoo.toronto.edu (Henry Spencer) (09/23/90)
In article <26675@mimsy.umd.edu> pete@mimsy.umd.edu (Pete Cottrell) writes: > I have heard that Cnews doesn't have any mechanism to junk the >old news and prevent its propogation, but I don't know for sure ... Sigh. Unfortunately correct. The trouble is that getdate() is relatively costly and Geoff is reluctant to run it on every single article just on the off-chance that it might be too old. (There is also a lesser problem in that it is *not* easy to pick a good number for "how old is too old?"; on the fringes of Usenet, fairly long propagation delays are not unheard-of.) We had hoped that keeping more history, using dbz's more efficient history database, would be sufficient. I think the answer is no. -- TCP/IP: handling tomorrow's loads today| Henry Spencer at U of Toronto Zoology OSI: handling yesterday's loads someday| henry@zoo.toronto.edu utzoo!henry
zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (09/23/90)
>> I have heard that Cnews doesn't have any mechanism to junk the >>old news and prevent its propogation, but I don't know for sure ... > >We had hoped that keeping more history, using dbz's more efficient history >database, would be sufficient. I think the answer is no. I agree, a netwide meltdown will happen sooner or later if it doesn't have such a check. I'd make it a #define so that leaf sites (which tend to have slower machines) don't have to run it. Speedups to getdate() might be possible also. -- Jon Zeeff (NIC handle JZ) zeeff@b-tech.ann-arbor.mi.us
emv@math.lsa.umich.edu (Edward Vielmetti) (09/24/90)
In article <1990Sep23.042833.24834@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
Sigh.
Sigh indeed.
We had hoped that keeping more history, using dbz's more efficient history
database, would be sufficient. I think the answer is no.
I've upped my history to 45 days, in the hopes that's enough. I should
know by October some time....
--Ed
Edward Vielmetti, U of Michigan math dept <emv@math.lsa.umich.edu>
moderator, comp.archives
"No funding, no fixing."
flee@dictionopolis.cs.psu.edu (Felix Lee) (09/24/90)
>The trouble is that getdate() is relatively costly
How about adding a "Machine-Readable-Date:" header in Unix time or
network time? Would it be worth it even if only a few sites added
such a field?
--
Felix Lee flee@cs.psu.edu
wb8foz@mthvax.cs.miami.edu (David Lesher) (09/24/90)
In <Fvrnzd92@cs.psu.edu> flee@dictionopolis.cs.psu.edu (Felix Lee) writes: >>The trouble is that getdate() is relatively costly >How about adding a "Machine-Readable-Date:" header in Unix time or >network time? Would it be worth it even if only a few sites added >such a field? Could the original much_flamed Cnews message_ID be considered a "Machine-Readable-Date" so that it carried two pieces of data in one header? -- A host is a host from coast to coast.....wb8foz@mthvax.cs.miami.edu & no one will talk to a host that's close............(305) 255-RTFM Unless the host (that isn't close)......................pob 570-335 is busy, hung or dead....................................33257-0335
rickert@mp.cs.niu.edu (Neil Rickert) (09/24/90)
In article <Fvrnzd92@cs.psu.edu> flee@dictionopolis.cs.psu.edu (Felix Lee) writes: >>The trouble is that getdate() is relatively costly > >How about adding a "Machine-Readable-Date:" header in Unix time or >network time? Would it be worth it even if only a few sites added >such a field? > How about checking the date, but making the date check optional. It could be controlled by an environment variable set in ${NEWSCTL}/bin/config so that it would be easy to switch on and off. That would allow sites which have neighbors who periodically release old news to throw it out. Other sites would not need to do anything. Incidentally in looking at some of the approx 8 Meg of month old news we received over the last two days, I notice that the PATH header has some sites shown twice. Isn't there supposed to be a check for this? -- =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*= Neil W. Rickert, Computer Science <rickert@cs.niu.edu> Northern Illinois Univ. DeKalb, IL 60115. +1-815-753-6940
jef@well.sf.ca.us (Jef Poskanzer) (09/24/90)
In the referenced message, henry@zoo.toronto.edu (Henry Spencer) wrote: }The trouble is that getdate() is relatively }costly and Geoff is reluctant to run it on every single article ...and then all sorts of people started coming up with rube goldberg schemes to avoid parsing dates. However, it turns out that even using C news's getdate (which is 10% slower than the B news version), parsing the dates in every article in a full Usenet feed takes about five Sun 3 CPU seconds per day. And if you were to use the lex-based date parser included in the MH distribution, you could get it down below a second per day, although it hardly seems worth the (minimal) effort. Would anyone care to post a patch to have C news do the date check? Seems like it should be about two lines of code. --- Jef Jef Poskanzer jef@well.sf.ca.us {ucbvax, apple, hplabs}!well!jef "Everything that needs to be said has already been said. But since no one was listening, everything must be said again." -- Andre Gide
tneff@bfmny0.BFM.COM (Tom Neff) (09/24/90)
In article <1990Sep23.042833.24834@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >Sigh. Unfortunately correct. The trouble is that getdate() is relatively >costly and Geoff is reluctant to run it on every single article just on >the off-chance that it might be too old. As currently written, getdate() is a marvelously general parser which does a wonderful job in the general case but may be passing up an opportunity for fast-parsing of the most common, well-behaved cases. Might it be possible to add a 'fast check' for date strings of the form dd Mmm yy hh:mm:ss {GMT|EST|EDT|PST|PDT} and bypass the complete state parser if this matches? Surely 90% of articles would pass the cheap test, and Cnews might then be able to afford to check article dates during rnews processing. Just a thought. -- To have a horror of the bourgeois (\( Tom Neff is bourgeois. -- Jules Renard )\) tneff@bfmny0.BFM.COM
henry@zoo.toronto.edu (Henry Spencer) (09/24/90)
In article <Fvrnzd92@cs.psu.edu> flee@dictionopolis.cs.psu.edu (Felix Lee) writes: >>The trouble is that getdate() is relatively costly > >How about adding a "Machine-Readable-Date:" header in Unix time or >network time? Would it be worth it even if only a few sites added >such a field? Any header change for this would take years to have real impact, I'm afraid. There is too much inertia in the net. -- TCP/IP: handling tomorrow's loads today| Henry Spencer at U of Toronto Zoology OSI: handling yesterday's loads someday| henry@zoo.toronto.edu utzoo!henry
henry@zoo.toronto.edu (Henry Spencer) (09/24/90)
In article <20730@well.sf.ca.us> Jef Poskanzer <jef@well.sf.ca.us> writes: >... it turns out that even using >C news's getdate (which is 10% slower than the B news version) I'm a little curious to know where the timing difference comes in, because C News getdate *is* the B News getdate. The only thing we did to it was a small fix that John Gilmore (I think) pointed out, for an unchecked array bound. > parsing >the dates in every article in a full Usenet feed takes about five Sun 3 >CPU seconds per day... The question is not absolute time but relative time: how much does it add to the time needed to file articles? I am frankly surprised that it's this quick, actually; some investigation is in order. >Would anyone care to post a patch to have C news do the date check? >Seems like it should be about two lines of code. It's not quite that easy. C News doesn't even notice the Date header at present; we don't even parse headers we don't need to know about. -- TCP/IP: handling tomorrow's loads today| Henry Spencer at U of Toronto Zoology OSI: handling yesterday's loads someday| henry@zoo.toronto.edu utzoo!henry
flee@guardian.cs.psu.edu (Felix Lee) (09/25/90)
> parsing the dates in every article in a full Usenet feed takes about > five Sun 3 CPU seconds per day. Hmm. Parsing 6000 dates takes 3 cpu seconds on a Sun 4/490, and 12 cpu seconds on an IBM RT. Is a Sun 3 really that fast? 12 cpu seconds is probably miniscule; I don't have timings for relaynews handy. -- Felix Lee flee@cs.psu.edu
michael@fts1.uucp (Michael Richardson) (09/25/90)
In article <1990Sep23.042833.24834@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >on the fringes of Usenet, fairly long propagation delays are not unheard-of.) > >We had hoped that keeping more history, using dbz's more efficient history >database, would be sufficient. I think the answer is no. One problem that I've been having recently (news feeds have been changing around at lot in Ottawa. At one point I was getting a near complete newsfeed from two places, and passing it on --- UNBATCHTED. Woops.) is that the dbz routines hit the ulimit (which seems to be set around 1.5meg for reasons that I haven't quite figured out... Probably something to do newsrun getting started from cron.) and start dying. Since this is SVR3.2, I have to run setnewsids in front of relaynews, so I stuck a 'ulimit(2,ulimit(1,1)*2);' in there and recompiled. I've also reduced the number of days of history that are kept. It might be a good idea to stick some code into setnewsids to deal with this situation. I personally would rather just have disk quotas... I haven't checked out cron to see what the ulimit is when it is run is. I'm going to try rebooting in case cron was stopped at some point and then restarted manually for some reason (I'm not the only sysadmin here). -- :!mcr!: | 'Golf sucks anyway --- give the land to the Michael Richardson | Indians' --- recent CANACHAT comment. Play: mcr@julie.UUCP Work: michael@fts1.UUCP Fido: 1:163/109.10 1:163/138 Amiga----^ - Pay attention only to _MY_ opinions. - ^--Amiga--^
zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (09/25/90)
>Might it be possible to add a 'fast check' for date strings of the form > > dd Mmm yy hh:mm:ss {GMT|EST|EDT|PST|PDT} Not only that, but for this particular use, just the dd Mmm yy info is probably enough. Up to 24 hrs of error isn't going to make much difference in "is this article way too old". -- Jon Zeeff (NIC handle JZ) zeeff@b-tech.ann-arbor.mi.us
henry@zoo.toronto.edu (Henry Spencer) (09/25/90)
In article <1990Sep25.025244.27071@fts1.uucp> michael@fts1.uucp (Michael Richardson) writes: >... the dbz >routines hit the ulimit... and start dying. > It might be a good idea to stick some code into setnewsids to deal with >this situation... Unfortunately, the existence of ulimit() is system-dependent. Sane systems don't have it. See further comments about it in notebook/problems. -- TCP/IP: handling tomorrow's loads today| Henry Spencer at U of Toronto Zoology OSI: handling yesterday's loads someday| henry@zoo.toronto.edu utzoo!henry
urlichs@smurf.sub.org (Matthias Urlichs) (09/27/90)
In news.software.b, article <1990Sep24.153234.26911@zoo.toronto.edu>, henry@zoo.toronto.edu (Henry Spencer) writes: < In article <Fvrnzd92@cs.psu.edu> flee@dictionopolis.cs.psu.edu (Felix Lee) writes: < >>The trouble is that getdate() is relatively costly < > Could it be possible that shipping lots of outdated articles around the world is even more costly? ;-) (Of course, there are different definitions of "cost" involved, but you'll get the idea.) < >How about adding a "Machine-Readable-Date:" header in Unix time or < >network time? Would it be worth it even if only a few sites added < >such a field? < < Any header change for this would take years to have real impact, I'm < afraid. There is too much inertia in the net. The header would be required for C News only, and I suspect that it would get around rather more quickly. On the other hand, if only getdate would support a more machine-palatable date format... (On the third hand, it isn't that costly anyway. And, while you're at it, how about getdate()ing the Expires: header, and writing that number into the history file?) -- Matthias Urlichs -- urlichs@smurf.sub.org -- urlichs@smurf.ira.uka.de /(o\ Humboldtstrasse 7 - 7500 Karlsruhe 1 - FRG -- +49+721+621127(0700-2330) \o)/
henry@zoo.toronto.edu (Henry Spencer) (09/28/90)
In article <xhrpf2.!p3@smurf.sub.org> urlichs@smurf.sub.org (Matthias Urlichs) writes: >(On the third hand, it isn't that costly anyway. And, while you're at it, how > about getdate()ing the Expires: header, and writing that number into the > history file?) We're looking at the problem, including Tom Neff's suggestion. As for the Expires: header, why bother? Expire will do the getdate() the first time it runs, and it's one more unnecessary delay in relaynews. -- Imagine life with OS/360 the standard | Henry Spencer at U of Toronto Zoology operating system. Now think about X. | henry@zoo.toronto.edu utzoo!henry