[news.software.b] Cnews and old news

pete@mimsy.umd.edu (Pete Cottrell) (09/23/90)

	I am running Bnews 2.11.14 and have HISTEXP set to 28 days. One of 
my major feeds is running Cnews and several times in the last few weeks has 
been feeding me lots of old articles. One time I got about 10,000 in 2
or 3  days, and I have gotten several thousand more in the last 3 days.
The latest batch are mostly 30-40 days old, but I've seen some going back 
to July and June. They all have the same Path leading up to mimsy:

mimsy!haven!decuac!e2big.mko.dec.com!bacchus.pa.dec.com!decwrl

On the other side of decwrl are various hosts, so it seems a good bet that
one of the DEC machines listed above is recycling the news.
	I have heard that Cnews doesn't have any mechanism to junk the
old news and prevent its propogation, but I don't know for sure since
I'm not familiar with it (I'll probably go to it sometime this fall - I
seem to remember Henry stating that they would be reorganizing the release
and was waiting for that. Any updates on this?). So, Cnews experts, how
does Cnews handle the question of old news? Is there anything I or my feeds
can do to prevent my news partition filling up with junk?
-- 
Spoken: Pete Cottrell 	UUCP: uunet!mimsy!pete      INTERNET: pete@cs.umd.edu  
Phone: 301-454-2025	USPS: U of Maryland, CS Dept., College Park, Md 20742

henry@zoo.toronto.edu (Henry Spencer) (09/23/90)

In article <26675@mimsy.umd.edu> pete@mimsy.umd.edu (Pete Cottrell) writes:
>	I have heard that Cnews doesn't have any mechanism to junk the
>old news and prevent its propogation, but I don't know for sure ...

Sigh.  Unfortunately correct.  The trouble is that getdate() is relatively
costly and Geoff is reluctant to run it on every single article just on
the off-chance that it might be too old.  (There is also a lesser problem
in that it is *not* easy to pick a good number for "how old is too old?";
on the fringes of Usenet, fairly long propagation delays are not unheard-of.)

We had hoped that keeping more history, using dbz's more efficient history
database, would be sufficient.  I think the answer is no.
-- 
TCP/IP: handling tomorrow's loads today| Henry Spencer at U of Toronto Zoology
OSI: handling yesterday's loads someday|  henry@zoo.toronto.edu   utzoo!henry

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (09/23/90)

>>	I have heard that Cnews doesn't have any mechanism to junk the
>>old news and prevent its propogation, but I don't know for sure ...
>
>We had hoped that keeping more history, using dbz's more efficient history
>database, would be sufficient.  I think the answer is no.

I agree, a netwide meltdown will happen sooner or later if it doesn't 
have such a check.  I'd make it a #define so that leaf sites (which 
tend to have slower machines) don't have to run it.  Speedups to 
getdate() might be possible also.  

-- 
Jon Zeeff (NIC handle JZ)	 zeeff@b-tech.ann-arbor.mi.us

emv@math.lsa.umich.edu (Edward Vielmetti) (09/24/90)

In article <1990Sep23.042833.24834@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:

   Sigh.  

Sigh indeed.

   We had hoped that keeping more history, using dbz's more efficient history
   database, would be sufficient.  I think the answer is no.

I've upped my history to 45 days, in the hopes that's enough.  I should
know by October some time....

--Ed

Edward Vielmetti, U of Michigan math dept <emv@math.lsa.umich.edu>
moderator, comp.archives

"No funding, no fixing."

flee@dictionopolis.cs.psu.edu (Felix Lee) (09/24/90)

>The trouble is that getdate() is relatively costly

How about adding a "Machine-Readable-Date:" header in Unix time or
network time?  Would it be worth it even if only a few sites added
such a field?
--
Felix Lee	flee@cs.psu.edu

wb8foz@mthvax.cs.miami.edu (David Lesher) (09/24/90)

In <Fvrnzd92@cs.psu.edu> flee@dictionopolis.cs.psu.edu (Felix Lee) writes:

>>The trouble is that getdate() is relatively costly

>How about adding a "Machine-Readable-Date:" header in Unix time or
>network time?  Would it be worth it even if only a few sites added
>such a field?

Could the original much_flamed Cnews message_ID be considered a
"Machine-Readable-Date" so that it carried two pieces of data in
one header?

-- 
A host is a host from coast to coast.....wb8foz@mthvax.cs.miami.edu 
& no one will talk to a host that's close............(305) 255-RTFM
Unless the host (that isn't close)......................pob 570-335
is busy, hung or dead....................................33257-0335

rickert@mp.cs.niu.edu (Neil Rickert) (09/24/90)

In article <Fvrnzd92@cs.psu.edu> flee@dictionopolis.cs.psu.edu (Felix Lee) writes:
>>The trouble is that getdate() is relatively costly
>
>How about adding a "Machine-Readable-Date:" header in Unix time or
>network time?  Would it be worth it even if only a few sites added
>such a field?
>
 How about checking the date, but making the date check optional.  It
could be controlled by an environment variable set in ${NEWSCTL}/bin/config
so that it would be easy to switch on and off.  That would allow sites
which have neighbors who periodically release old news to throw it out.
Other sites would not need to do anything.

 Incidentally in looking at some of the approx 8 Meg of month old news
we received over the last two days, I notice that the PATH header has some
sites shown twice.  Isn't there supposed to be a check for this?

-- 
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
  Neil W. Rickert, Computer Science               <rickert@cs.niu.edu>
  Northern Illinois Univ.
  DeKalb, IL 60115.                                  +1-815-753-6940

jef@well.sf.ca.us (Jef Poskanzer) (09/24/90)

In the referenced message, henry@zoo.toronto.edu (Henry Spencer) wrote:
}The trouble is that getdate() is relatively
}costly and Geoff is reluctant to run it on every single article

...and then all sorts of people started coming up with rube goldberg
schemes to avoid parsing dates.  However, it turns out that even using
C news's getdate (which is 10% slower than the B news version), parsing
the dates in every article in a full Usenet feed takes about five Sun 3
CPU seconds per day.  And if you were to use the lex-based date parser
included in the MH distribution, you could get it down below a second
per day, although it hardly seems worth the (minimal) effort.

Would anyone care to post a patch to have C news do the date check?
Seems like it should be about two lines of code.
---
Jef

  Jef Poskanzer  jef@well.sf.ca.us  {ucbvax, apple, hplabs}!well!jef
"Everything that needs to be said has already been said.  But since no
   one was listening, everything must be said again." -- Andre Gide

tneff@bfmny0.BFM.COM (Tom Neff) (09/24/90)

In article <1990Sep23.042833.24834@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>Sigh.  Unfortunately correct.  The trouble is that getdate() is relatively
>costly and Geoff is reluctant to run it on every single article just on
>the off-chance that it might be too old.  

As currently written, getdate() is a marvelously general parser which
does a wonderful job in the general case but may be passing up an
opportunity for fast-parsing of the most common, well-behaved cases.
Might it be possible to add a 'fast check' for date strings of the form

	dd Mmm yy hh:mm:ss {GMT|EST|EDT|PST|PDT}

and bypass the complete state parser if this matches?  Surely 90% of
articles would pass the cheap test, and Cnews might then be able to
afford to check article dates during rnews processing.

Just a thought.

-- 
To have a horror of the bourgeois   (\(    Tom Neff
is bourgeois. -- Jules Renard        )\)   tneff@bfmny0.BFM.COM

henry@zoo.toronto.edu (Henry Spencer) (09/24/90)

In article <Fvrnzd92@cs.psu.edu> flee@dictionopolis.cs.psu.edu (Felix Lee) writes:
>>The trouble is that getdate() is relatively costly
>
>How about adding a "Machine-Readable-Date:" header in Unix time or
>network time?  Would it be worth it even if only a few sites added
>such a field?

Any header change for this would take years to have real impact, I'm
afraid.  There is too much inertia in the net.
-- 
TCP/IP: handling tomorrow's loads today| Henry Spencer at U of Toronto Zoology
OSI: handling yesterday's loads someday|  henry@zoo.toronto.edu   utzoo!henry

henry@zoo.toronto.edu (Henry Spencer) (09/24/90)

In article <20730@well.sf.ca.us> Jef Poskanzer <jef@well.sf.ca.us> writes:
>... it turns out that even using
>C news's getdate (which is 10% slower than the B news version)

I'm a little curious to know where the timing difference comes in, because
C News getdate *is* the B News getdate.  The only thing we did to it was
a small fix that John Gilmore (I think) pointed out, for an unchecked array
bound.

> parsing
>the dates in every article in a full Usenet feed takes about five Sun 3
>CPU seconds per day...

The question is not absolute time but relative time:  how much does it add
to the time needed to file articles?  I am frankly surprised that it's
this quick, actually; some investigation is in order.

>Would anyone care to post a patch to have C news do the date check?
>Seems like it should be about two lines of code.

It's not quite that easy.  C News doesn't even notice the Date header
at present; we don't even parse headers we don't need to know about.
-- 
TCP/IP: handling tomorrow's loads today| Henry Spencer at U of Toronto Zoology
OSI: handling yesterday's loads someday|  henry@zoo.toronto.edu   utzoo!henry

flee@guardian.cs.psu.edu (Felix Lee) (09/25/90)

> parsing the dates in every article in a full Usenet feed takes about
> five Sun 3 CPU seconds per day.

Hmm.  Parsing 6000 dates takes 3 cpu seconds on a Sun 4/490, and 12
cpu seconds on an IBM RT.  Is a Sun 3 really that fast?

12 cpu seconds is probably miniscule; I don't have timings for
relaynews handy.
--
Felix Lee	flee@cs.psu.edu

michael@fts1.uucp (Michael Richardson) (09/25/90)

In article <1990Sep23.042833.24834@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>on the fringes of Usenet, fairly long propagation delays are not unheard-of.)
>
>We had hoped that keeping more history, using dbz's more efficient history
>database, would be sufficient.  I think the answer is no.

  One problem that I've been having recently (news feeds have been changing
around at lot in Ottawa. At one point I was getting a near complete newsfeed
from two places, and passing it on --- UNBATCHTED. Woops.) is that the dbz
routines hit the ulimit (which seems to be set around 1.5meg for reasons that
I haven't quite figured out... Probably something to do newsrun getting started
from cron.) and start dying. 
  Since this is SVR3.2, I have to run setnewsids in front of relaynews, so
I stuck a 'ulimit(2,ulimit(1,1)*2);' in there and recompiled.
  I've also reduced the number of days of history that are kept. 
  It might be a good idea to stick some code into setnewsids to deal with
this situation. I personally would rather just have disk quotas...

  I haven't checked out cron to see what the ulimit is when it is run is.
  I'm going to try rebooting in case cron was stopped at some point and
then restarted manually for some reason (I'm not the only sysadmin here).

-- 
   :!mcr!:            |  'Golf sucks anyway --- give the land to the
   Michael Richardson |    Indians'  --- recent CANACHAT comment.
 Play: mcr@julie.UUCP Work: michael@fts1.UUCP Fido: 1:163/109.10 1:163/138
    Amiga----^     - Pay attention only to _MY_ opinions. -   ^--Amiga--^

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (09/25/90)

>Might it be possible to add a 'fast check' for date strings of the form
>
>	dd Mmm yy hh:mm:ss {GMT|EST|EDT|PST|PDT}

Not only that, but for this particular use, just the dd Mmm yy info is
probably enough.  Up to 24 hrs of error isn't going to make much difference
in "is this article way too old".


-- 
Jon Zeeff (NIC handle JZ)	 zeeff@b-tech.ann-arbor.mi.us

henry@zoo.toronto.edu (Henry Spencer) (09/25/90)

In article <1990Sep25.025244.27071@fts1.uucp> michael@fts1.uucp (Michael Richardson) writes:
>... the dbz
>routines hit the ulimit... and start dying. 
>  It might be a good idea to stick some code into setnewsids to deal with
>this situation...

Unfortunately, the existence of ulimit() is system-dependent.  Sane systems
don't have it.  See further comments about it in notebook/problems.
-- 
TCP/IP: handling tomorrow's loads today| Henry Spencer at U of Toronto Zoology
OSI: handling yesterday's loads someday|  henry@zoo.toronto.edu   utzoo!henry

urlichs@smurf.sub.org (Matthias Urlichs) (09/27/90)

In news.software.b, article <1990Sep24.153234.26911@zoo.toronto.edu>,
  henry@zoo.toronto.edu (Henry Spencer) writes:
< In article <Fvrnzd92@cs.psu.edu> flee@dictionopolis.cs.psu.edu (Felix Lee) writes:
< >>The trouble is that getdate() is relatively costly
< >
Could it be possible that shipping lots of outdated articles around the world
is even more costly?  ;-)
(Of course, there are different definitions of "cost" involved, but you'll
 get the idea.)

< >How about adding a "Machine-Readable-Date:" header in Unix time or
< >network time?  Would it be worth it even if only a few sites added
< >such a field?
< 
< Any header change for this would take years to have real impact, I'm
< afraid.  There is too much inertia in the net.

The header would be required for C News only, and I suspect that it would get
around rather more quickly.

On the other hand, if only getdate would support a more machine-palatable
date format...
(On the third hand, it isn't that costly anyway. And, while you're at it, how
 about getdate()ing the Expires: header, and writing that number into the 
 history file?)

-- 
Matthias Urlichs -- urlichs@smurf.sub.org -- urlichs@smurf.ira.uka.de     /(o\
Humboldtstrasse 7 - 7500 Karlsruhe 1 - FRG -- +49+721+621127(0700-2330)   \o)/

henry@zoo.toronto.edu (Henry Spencer) (09/28/90)

In article <xhrpf2.!p3@smurf.sub.org> urlichs@smurf.sub.org (Matthias Urlichs) writes:
>(On the third hand, it isn't that costly anyway. And, while you're at it, how
> about getdate()ing the Expires: header, and writing that number into the 
> history file?)

We're looking at the problem, including Tom Neff's suggestion.  As for
the Expires: header, why bother?  Expire will do the getdate() the first
time it runs, and it's one more unnecessary delay in relaynews.
-- 
Imagine life with OS/360 the standard  | Henry Spencer at U of Toronto Zoology
operating system.  Now think about X.  |  henry@zoo.toronto.edu   utzoo!henry