[news.software.b] news software speedup

stuart@bms-at.UUCP (Stuart D. Gathman) (03/23/88)

I have been using the Bnews software for about 2 years now.  It is
very useful for in house as well as usenet purposes.

I have had some ideas about implementing the article storage.  Please
send flames via E-mail.  (Really!)

A major problem with the current system is scanning article headers
in many seperate files.  (And unix doesn't like big directories to boot.)
My idea is to have 2 files per newsgroups directory (other than sub-
directories).  All headers for a news group would be in one file
with offsets into another file containing all articles.

Processing incoming news would then be faster.  Programs like 'vn' would
be orders of magnitude faster.  The only problem is 'expire'.  I maintain
that 'expire' would still be reasonable.  It would work by reading the
header file and writing a new version for each newsgroup.  It can seek
past articles to be deleted while copying the ones to be retained
to a new file.  When finished, move the new versions into place.  This
needs to be done only one newsgroup at a time, so there is no disk
space problem.  Not only that, but no history file is needed!  (The
actual arrival time can be stored in the header file if desired.)  Since
expire is run on a batch basis, its decreased performance is not
a problem.  (At least compared with the current slow performance of
interactive programs.)

The time to read the header file is comparable
to that required to read the current history files.  It can be reduced
by storing only some headers in the header file.  The rest can be stored
in the article file.  Checking an incoming article to see if it is
already present would be faster than the current SysV scheme of the
10 history files because a single header file will be much smaller than
a 10th of the history files.  (A dbm file could still be used also.)
This assumes that the newsgroup information is consistent in duplicate
articles.  If only the article ID should be relied on, a history
database of some form is still necessary.

This method also uses less inodes.  A problem on small systems.

The only improvement on this method I can think of would be to introduce
an indexing file structure that handles variable length records.  But
this loses the advantage of simplicity.
-- 
Stuart D. Gathman	<stuart@bms-at.uucp>
			<..!{vrdxhq|daitc}!bms-at!stuart>

lyndon@ncc.UUCP (Lyndon Nerenberg) (03/24/88)

In article <649@bms-at.UUCP> stuart@bms-at.UUCP (Stuart D. Gathman) writes:

   A major problem with the current system is scanning article headers
   in many seperate files.  (And unix doesn't like big directories to boot.)
   My idea is to have 2 files per newsgroups directory (other than sub-
   directories).  All headers for a news group would be in one file
   with offsets into another file containing all articles.

   Processing incoming news would then be faster.  Programs like 'vn' would
   be orders of magnitude faster.  The only problem is 'expire'.  I maintain
   that 'expire' would still be reasonable.  It would work by reading the
   header file and writing a new version for each newsgroup.  It can seek
   past articles to be deleted while copying the ones to be retained
   to a new file.  When finished, move the new versions into place.  This
   needs to be done only one newsgroup at a time, so there is no disk
   space problem.

There is some potential for trouble on System V implementations using
this method. Most USG systems are configured with a ulimit of 1 meg.
If you exceed this limit within a particular newsgroup, articles will
be lost when the append to the file fails.

There are two solutions: 

	1) Increase the ulimit by reconfiguring the kernel, or
	2) Make rnews setuid to root, allowing it to increase
	   its ulimit value.

In many cases, 1) is impossible (or very difficult), depending on
how the vendor has packaged their USG port. Also, an administrator
may have a good reason for maintaining a small(er) ulimit value.

As for making rnews setuid to root, I have to be a bit nervous about
the idea given the recent discussion on setuid security. I would want
to look at the rnews source quite closely before implementing it in 
this manner.

Not that I'm trying to knock the idea - it looks promising.
-- 
--lyndon  {alberta,uunet}!ncc!lyndon  lyndon%ncc@uunet.uu.net

wunder@hpcea.CE.HP.COM (Walter Underwood) (03/24/88)

   / hpcea:news.software.b / stuart@bms-at.UUCP (Stuart D. Gathman) / 12:17 pm  Mar 22, 1988 /
   My idea is to have 2 files per newsgroups directory (other than sub-
   directories).  All headers for a news group would be in one file
   with offsets into another file containing all articles.
   ----------

Congratulations.  You have just invented Notesfiles.  Of course,
Ray Essick invented it in 1981.

Didn't somebody else re-invent uuencode two weeks ago?

wunder

rsalz@bbn.com (Rich Salz) (03/24/88)

>In article <649@bms-at.UUCP> stuart@bms-at.UUCP (Stuart D. Gathman) writes:
>   A major problem with the current system is scanning article headers
>   in many seperate files.  (And unix doesn't like big directories to boot.)
>   My idea is to have 2 files per newsgroups directory (other than sub-
>   directories).  All headers for a news group would be in one file
>   with offsets into another file containing all articles.

This is basically how notes stores stuff.

It can get you some speed benefits, but makes updates (e.g., expire,
cancelled articles, new articles) much harder, and some things much
slower.  Currently, to add a new article, you need merely lock the active
file for a moment to bump a number.  This way you'd have to lock two big
files for a significant amount of time.

You'd also have to change the .newsrc (and all the news readers), unless
you had a third file that mapped "article number" to header file
offset.  (You need to be able to remove expired articles and close up
holes in the header file, lest it grow without bound.)

Also having one article/file makes hundreds of Unix utilities more useful
than they'd otherwise be (doing grep on a notes "text" file brings no joy;
I've done it).
	/r$
-- 
Please send comp.sources.unix-related mail to rsalz@uunet.uu.net.

chuq@plaid.Sun.COM (Chuq Von Rospach) (03/24/88)

>It can get you some speed benefits, but makes updates (e.g., expire,
>cancelled articles, new articles) much harder, and some things much
>slower.  Currently, to add a new article, you need merely lock the active
>file for a moment to bump a number.  This way you'd have to lock two big
>files for a significant amount of time.

At one point, I circulated a preliminary design that did something similar.
Basically, you took the batch file, and instead of splitting it up, you
simply stuck filename and lseek pointers into the history file for each
article. No unpacking overhead at all, and significantly reduced filesystem
and system call overhead. 

Expiration is fairly trivial, too. A batchfile gets over a certain age, you
zap it and update the pointers.

It got shot down then for a number of reasons, generally good ones. 

o You lose the "Expires:" header. Stuff that is supposed to stay longer
	can't.

o You lose adjustable expirations. You can't expire talk.* faster, because
	it's all stuck in with everything else. Not useful for small disk
	systems (and this is back when usenet traffic was really heavy, like
	almost 8 megabytes a month!)

o It isn't clean for locally posted or non-batched articles. At the simplest
	layer, they're simply batches with single articles. But if you've 
	got lots of local posting or non-batched articles floating around,
	the system degenerates into a setup WORSE than the current system,
	because the tree is completely flat. ooph.

One possibility is to create two files for every newsgroup, a pointer file,
an index file, and a data file.

The pointer file has three pieces of data: the message number, the lseek
pointer and the Subject line (maybe a fourth, the poster). You add new
entries to the bottom of the file, once a day a program runs that expires
articles by zapping the data and rewriting the pointer file with the updated
pointers.

The fun part is the data. You want to avoid doing a copy of the data file
each night. So rather than brute force, you get fancy. Since the file is
split up into known size blocks, you can simply delete an article in place
and mark the block free. When a new article comes in, it looks for a first
fit (or best fit, I don't care) in the free space and writes itself into the
block, using the index file to look for free space without having to
linearly search the data file (perhaps the index file could be put in the
data file in the block header info....). If there's no space, append it to
the end and increase the size of the file.

This would require a program that would allow an admin to scrunch the file
as well, to zap holes and fragmentation. But unless you get into a space
crunch, you shouldn't have to run it.

Thinking about it, this has another interesting advantage. In theory, you
could get rid of the expiration phase completely. Instead, set a maximum
size for each newsgroup. When a new message comes in, if if doesn't fit in
the file, the installation program goes to the oldest message and deletes
it, and continues to delete messages until it does fit. This requires a more
sophisticate indexing scheme than above, because you'd likely want to get
into garbage collection and compacting fragmented space. But ALL of the
maintenance of the files would be done at article creation time, with no
midnight processing required. 

Sounds like a neat masters thesis, if you ask me.

comments? 


Chuq Von Rospach			chuq@sun.COM		Delphi: CHUQ

                               Speed it up. Keep it Simple. Ship it on time.
                                                            -- Bill Atkinson

sl@van-bc.UUCP (pri=-10 Stuart Lynne) (03/27/88)

In article <10150@ncc.UUCP> lyndon@ncc.UUCP (Lyndon Nerenberg) writes:
>In article <649@bms-at.UUCP> stuart@bms-at.UUCP (Stuart D. Gathman) writes:
>
>   A major problem with the current system is scanning article headers
>   in many seperate files.  (And unix doesn't like big directories to boot.)
>   My idea is to have 2 files per newsgroups directory (other than sub-
>   directories).  All headers for a news group would be in one file
>   with offsets into another file containing all articles.


>There is some potential for trouble on System V implementations using
>this method. Most USG systems are configured with a ulimit of 1 meg.
>If you exceed this limit within a particular newsgroup, articles will
>be lost when the append to the file fails.
>
>There are two solutions: 
>
>	1) Increase the ulimit by reconfiguring the kernel, or
>	2) Make rnews setuid to root, allowing it to increase
>	   its ulimit value.
>

Actually there is a third solution. The original suggestion was to store
headers in one file and all articles in a second file. The rationale being
to reduce significantly the use of Inodes, processing time for news readers
etc.

This could be modified to be n+1 files, the first file stores the headers,
plus offset *and* file name for each article. Instead of storing all
articles simply spread them among several (n) files. Several different types
of criteria could be applied to place the articles among the files:

	1. a new file everytime the 1MB ulimit was in danger of being broached
	2. a new file every day/week/arbitrary period
	3. multiple files based on keyword/subject/other sorting criteria
	4. a new file for each article

The fourth option is essentially status quo except for moving the history
file for /usr/lib/news/history to be distributed into the news spool
directories.

I do see some problems for cross posting. But certainly they wouldn't be
insurmountable.



-- 
{ihnp4!alberta!ubc-vision,uunet}!van-bc!Stuart.Lynne Vancouver,BC,604-937-7532

gnu@hoptoad.uucp (John Gilmore) (03/27/88)

lyndon@ncc.UUCP (Lyndon Nerenberg) wrote:
> There is some potential for trouble on System V implementations using
> this method. Most USG systems are configured with a ulimit of 1 meg.

I think this is considered a bug by one and all.  (At least by me!)
If we distribute a netnews release that requires that the
bug be fixed, the *&^$%# vendors who ship systems with this bug will
scramble to fix it.

Saying "we shouldn't speed up netnews with a new architecture because of a
System V bug" is like saying "we shouldn't run netnews in the current
architecture because a System V bug loses inodes when you do".  The
solution is to fix the bug, not to stop working on making netnews faster
and easier to manage.

I am pretty happy with the speed of "rnews" under C news.  Where I want
more performance is on "expire"; it takes hoptoad more than an hour to
run a single expire each night, and disk response on that drive goes to
hell.  I *think* the problem is in the dbm code.  Note that expire
spends most of its time handling entries for already-expired messages,
since typically the message-ID-retention is set to e.g. 28 days while
the expire threshold is at 14 or 10 or 8.  Nevertheless, it seems to grind
up the disk even while processing these, and all it's doing is reading
a line, doing an ftell(), writing the line out, and adding a dbm entry.

If somebody gets fired up to rewrite dbm in the public domain, please
make an "in core" option where it won't write to the disk at all, just
malloc the blocks it wants, and keep track of where it wants to put
them.  My dbm files are under 3MB total, and if I can keep most of that
in memory while building them, we should be able to zip right through
the expired stuff in no time, without moving the disk heads at all.
-- 
{pyramid,ptsfa,amdahl,sun,ihnp4}!hoptoad!gnu			  gnu@toad.com
		"Watch me change my world..." -- Liquid Theatre

geoff@utstat.uucp (Geoff Collyer) (03/28/88)

I rebuilt the dbm history files from the ASCII history file on dalcs
(a Sun 4 with 32Mb of main memory) in 23 seconds elapsed time.  dalcs
gets a full news feed and it took five minutes just to build the ASCII
history file using find and some small filters.  With 32Mb of memory,
the buffer cache was 3.2Mb, and that gives essentially the same speedup
that John Gilmore proposed to achieve by modifying dbm itself.

I recently increased the buffer cache on our file servers to 25% of
available memory rather than 10%, as the servers used to have plenty of
free memory, even at busy times.  (Those of you on Suns can just patch
_bufpages in /vmunix to be the number of pages to devote to the cache,
then reboot.)  I haven't timed expire recently and in any case we get
a small news feed here.
-- 
Geoff Collyer	utzoo!utstat!geoff, utstat.toronto.{edu,cdn}!geoff

davidsen@steinmetz.steinmetz.ge.com (William E. Davidsen Jr) (03/28/88)

In article <4246@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
|  [...]
| Saying "we shouldn't speed up netnews with a new architecture because of a
| System V bug" is like saying "we shouldn't run netnews in the current
| architecture because a System V bug loses inodes when you do".  The
| solution is to fix the bug, not to stop working on making netnews faster
| and easier to manage.

You stated that pretty poorly. Obviously having the ulimit (or any other
configurable parameter) set to an inappropriate value is a "vendor
problem" rather than a "System V bug." There is a fix for that at boot
time, but I don't remember what it is just now, although I used some
time ago.

I think it would be unwise to speed up expire or anything else by using
3-4 MB of memory (forgive me if I misread your intension). There are a
LOT of machines which don't have that much memory, real or virtual. The
use of a dbm allows fast access to data too large for memory.

AT&T states that Xenix represents 60+% of all systems sold (by systems,
not users). Add in the unix-pc and other small boxes, and it seems that
there are more systems which would have trouble than would not. Add in
the VAXen which wouldn't like someone running a 3-4MB program even with
virtual memory, and it's probably not a good idea.

I think news has done a good job of balancing efficiency with
usefulness, and I would hate to have a new version not usable on all the
PDP-11's, AT's, unix-pc's, small VAXen, etc. There are a lot of diskless
workstations which wouldn't do well paging over a network, either.

I hope I misread your comment, or that you were making it as an
observation, rather than what was really intended or desirable. If
programs on large memory machines can run faster, great, I will use all
the memory in my personal system, while sparing the little box I use at
work.

-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

lyndon@ncc.UUCP (Lyndon Nerenberg) (03/28/88)

In article <4246@hoptoad.uucp>, gnu@hoptoad.uucp (John Gilmore) writes:
> lyndon@ncc.UUCP (Lyndon Nerenberg) wrote:
> > There is some potential for trouble on System V implementations using
> > this method. Most USG systems are configured with a ulimit of 1 meg.
> 
> I think this is considered a bug by one and all.  (At least by me!)
> If we distribute a netnews release that requires that the
> bug be fixed, the *&^$%# vendors who ship systems with this bug will
> scramble to fix it.

I doubt that very much - if the vendors paid any attention to what we
say on the net there wouldn't be any bugs left :-)

I agree that ulimit is braindamaged. I *love* my new 3/280 :-)
However, I still try to write my code so it will work on these
braindead machines. After all, I spent four years working on one,
and I have to feel sorry for those people who (for whatever reason)
can't upgrade. Fortunately, many of the people who posted source
to the net tried to make their code portable (no flexnames, properly
ifdef'd job control stuff, etc) - to them I am forever grateful!

Does anyone have *ACCURATE* numbers showing the number of USG 5.x ( x < 5.3)
machines vs. BSD 4.x (x > 1).  I have a funny feeling that there are
a lot more of these low IQ boxes out there than anyone imagines. I
don't think it's fair to cut off 40% of the net just because AT&T
is out to lunch...

-- 
lyndon  {alberta,uunet}!ncc!lyndon  lyndon%ncc@uunet.uu.net

gnu@hoptoad.uucp (John Gilmore) (03/28/88)

davidsen@steinmetz.steinmetz.ge.com (William E. Davidsen Jr) wrote:
>                                Obviously having the ulimit (or any other
> configurable parameter) set to an inappropriate value is a "vendor
> problem" rather than a "System V bug."

Not if in the default System V, that AT&T supplies to all the vendors,
the parameter is configured wrong.  That's a bug.  It's a bug that Sun
ships SunOS configured to run out of text table entries if you bring up
seven windows and run the compiler.  Just because you can fix it
without sources doesn't mean it isn't a bug.  If it's a "vendor problem"
howcum all the Sys V vendors have the problem?

The default file size limit should always be no limit.  People who want
a limit can always reduce it in their .login file, but the way it's
implemented you can't *increase* it if you don't like it, the way you
can in BSD.  I don't see why J Random User would ever want to limit the
size of a file he can manipulate though.  It's useless as a system
management tool; somebody who wants to fill the file system can always
just create N files, each of 1 megabyte.  I've heard the stories about
saving yourself from runaway processes filling the disk; tell me, when
is the last time that happened to *you*?  And it only took one "rm"
command to fix it, right?  Ulimit is just a hassle with no benefit, aka a bug.

> I think it would be unwise to speed up expire or anything else by using
> 3-4 MB of memory (forgive me if I misread your intension). There are a
> LOT of machines which don't have that much memory, real or virtual.

My intention was to make an *option* to use large virtual memory to turn
a 1.5 hour process into a 10 minute process.  People on small machines
are welcome to thrash their disk heads for 90 minutes; I just don't
see any reason to make my (4MB physical memory) Sun do that, when a
pretty simple option in the code would eliminate it.

======

On the ulimit question, I find it really hard to believe that *anybody*
seriously thinks we should write all our applications such that they
will never, under any circumstances, produce files larger than 1MB,
because AT&T busted one of its releases.  While we're at it, let's make
sure that nobody is allowed to type any lines wider than 80 columns,
and build file systems that can only hold 16 megabytes before you have
to partition your disk.  Since AT&T doesn't support TCP/IP, let's tear
down the Internet, toss NNTP, and go back to 300 baud modems.  No,
wait!  Let's build a whole computer so brain damaged that you can never
use more than 64K of data at once, even though the machine can have
megabytes of main memory!  Oh...your hardware has lots of disk and
columns and networks and main memory?  No problem, we'll get AT&T to
set a limit in software!  Now we're getting somewhere...    :-)
-- 
{pyramid,ptsfa,amdahl,sun,ihnp4}!hoptoad!gnu			  gnu@toad.com
		"Watch me change my world..." -- Liquid Theatre

paul@vixie.UUCP (Paul Vixie Esq) (03/30/88)

In article <4265@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
>					While we're at it, let's make
>sure that nobody is allowed to type any lines wider than 80 columns,
>and build file systems that can only hold 16 megabytes before you have
>to partition your disk.

Hey, John, as long as we're going to improve things, let's set a 14-character
limit on filenames.  Let's make sure that you have to be super-user to move
a subdirectory into a different parent.  Can't be too careful, you know.

I wish I felt :-) about this... But I don't.
-- 
Paul A Vixie Esq
paul%vixie@uunet.uu.net
{uunet,ptsfa,hoptoad}!vixie!paul
San Francisco, (415) 647-7023

davidsen@steinmetz.steinmetz.ge.com (William E. Davidsen Jr) (03/30/88)

In article <4265@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
| [...]
| My intention was to make an *option* to use large virtual memory to turn
| a 1.5 hour process into a 10 minute process.  People on small machines
| are welcome to thrash their disk heads for 90 minutes; I just don't
| see any reason to make my (4MB physical memory) Sun do that, when a
| pretty simple option in the code would eliminate it.

I think that's what I said in part of my posting you didn't quote. What
I want to avoid is *requiring* a lot of memory, not using it if you have
it.

| On the ulimit question, I find it really hard to believe that *anybody*
| seriously thinks we should write all our applications such that they
| will never, under any circumstances, produce files larger than 1MB,
| because AT&T busted one of its releases.  While we're at it, let's make
| sure that nobody is allowed to [ lines and lines of stuff ]

You can approach this is (at least) three ways. You can set the program
to never produce a large file (which I didn't propose), so it will
seldon produce a large file (which I do), or so that it will almost
always produce a large file (such as putting all news for all groups in
one file or somesuch) which isn't proposed at the moment, either.

The choice of setting options with restrictive limits or not is a matter
of opinion, and not a technical issue. For that reason I don't feel that
either of us will convince the other, although I still feel that your
classification of a choice you don't like as a "bug," might mislead
someone into thinking that it doesn't work as documented.

The consequences of having a low limit are that a few programs may fail. 
A high limit will result in having the first program hung in a write
loop run a filesystem out of space.  If this is a user filespace other
users won't be able to save their work, while if it's the system tmp
space some compilers and editors will stop working.  ulimit is not
useful against a malicious user who wants to hurt the system, pnly for
the klutz who loops a program (like someone learning C or F77, perhaps). 

-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

wcs@ho95e.ATT.COM (Bill.Stewart.<ho95c>) (03/30/88)

In article <4265@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
:it takes hoptoad more than an hour to run a single expire each night,

	ho95e (an ancient 3B20, with 1MB ulimit) expires in 10 minutes.
This is for expire -e6 -E10 -n !comp ; on even-numbered nights
I expire everything with -e10, which probably takes twice as long?.
I'm running 2.11.8, with no -dbm (sigh; why isn't it SysV? It was in V7)
so I'm using the big-history-file method.  (A 3B20 is a 1-MIPS bit-sliced
machine, built when we were The Phone Company and 4MB was very big.)

: [..].  I *think* the problem is in the dbm code.  Note that expire
:spends most of its time handling entries for already-expired messages,
:since typically the message-ID-retention is set to e.g. 28 days while
:the expire threshold is at 14 or 10 or 8.

Is it really necessary to retain message-IDs past about 14 days?  The
news.lists postings indicate 99% of everything reaches uunet in 5 days;
surely there aren't too many 15-day-long loops any more?  Probably better
to keep the expire time down and tolerate the <.1% duplication load.
(Admittedly, one reason there aren't many delay loops is that the
backbone machines probably keep a lot of history.)  For ho95e, the
retention limit is forced on us by ulimit; we get history file overflows
if we keep records longer than 11 days, and VOLUME is up.

In article <4246@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
: [various true things about SysV ulimit braindamage]
: [maybe if everyone bitches at their vendors simultaneously they'll fix it]
	Naaah...  At least SVR3 gives you configurable amounts of braindamage.
-- 
#				Thanks;
# Bill Stewart, AT&T Bell Labs 2G218, Holmdel NJ 1-201-949-0705 ihnp4!ho95c!wcs
# So we got out our parsers and debuggers and lexical analyzers and various 
# implements of destruction and went off to clean up the tty driver...

heiby@falkor.UUCP (Ron Heiby) (03/30/88)

John Gilmore (gnu@hoptoad.uucp) writes:
> davidsen@steinmetz.steinmetz.ge.com (William E. Davidsen Jr) wrote:
> >                                Obviously having the ulimit (or any other
> > configurable parameter) set to an inappropriate value is a "vendor
> > problem" rather than a "System V bug."
> 
> If it's a "vendor problem" howcum all the Sys V vendors have the problem?

Proof by counter-example:  Motorola Microcomputer Division's VME Delta
Series computers use System V Release 3.0 and have "ulimit" as a tunable
parameter.  So, you see, at least one vendor has "fixed" this "bug".

(A bit of history:  When AT&T entered the computer business (commercially),
some of us (I used to work there.) saw the 1Meg ulimit as a bad thing.  Over
the course of about two years, we managed to get UNIX development to make
it a tunable.  It is tunable in AT&T's Release 3.1.  The value is referenced
a total of ONE place in the kernel.)
-- 
Ron Heiby, heiby@mcdchg.UUCP	Moderator: comp.newprod & comp.unix
"I believe in the Tooth Fairy."  "I believe in Santa Claus."
	"I believe in the future of the Space Program."

david@ms.uky.edu (David Herron -- Resident E-mail Hack) (03/31/88)

In article <2090@ho95e.ATT.COM> wcs@ho95e.UUCP (46323-Bill.Stewart.<ho95c>,2G218,x0705,) writes:
>Is it really necessary to retain message-IDs past about 14 days?  The
>news.lists postings indicate 99% of everything reaches uunet in 5 days;
>surely there aren't too many 15-day-long loops any more?  Probably better
>to keep the expire time down and tolerate the <.1% duplication load.
>(Admittedly, one reason there aren't many delay loops is that the
>backbone machines probably keep a lot of history.)  For ho95e, the
>retention limit is forced on us by ulimit; we get history file overflows
>if we keep records longer than 11 days, and VOLUME is up.

I had the chance recently to be looking closely at what news was
coming through the system and saw "quite a bit" of stuff which was
over a month old.
-- 
<---- David Herron -- The E-Mail guy            <david@ms.uky.edu>
<---- or:                {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET
<----
<---- I don't have a Blue bone in my body!

gerry@syntron.UUCP (G. Roderick Singleton) (03/31/88)

In article <4265@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
>davidsen@steinmetz.steinmetz.ge.com (William E. Davidsen Jr) wrote:
>>                                Obviously having the ulimit (or any other
>> configurable parameter) set to an inappropriate value is a "vendor
>> problem" rather than a "System V bug."
>
>Not if in the default System V, that AT&T supplies to all the vendors,

 [almost whole message deleted]

>wait!  Let's build a whole computer so brain damaged that you can never
>use more than 64K of data at once, even though the machine can have
>megabytes of main memory!  Oh...your hardware has lots of disk and

 [smiley and 2 lines deleted ]

Smiley or not,  you'll have to get ATT to do something else.  IBM has
already created this monster and sold it to an unsuspecting public.
Oh well, ATT can try too :-)

ger

-- 
G. Roderick Singleton, Technical Services Manager
{ syntron | geac | eclectic }!gerry
"ALL animals are created equal, BUT some animals are MORE equal than others."
George Orwell

rick@seismo.CSS.GOV (Rick Adams) (04/01/88)

95% of the sites should not bother with keeping a history of expired
artciles. However, I think it is important that "major" sites keep the
old history. They usually have the resources to do it, and by doing so
will "defend" the "minor" sites from old articles.

--rick

emv@starbarlounge (Edward Vielmetti) (04/01/88)

In article <44282@beno.seismo.CSS.GOV> rick@seismo.CSS.GOV (Rick Adams) writes:
>95% of the sites should not bother with keeping a history of expired
>artciles. However, I think it is important that "major" sites keep the
>old history. They usually have the resources to do it, and by doing so
>will "defend" the "minor" sites from old articles.
>
>--rick                        

How many days of expired articles are necessary?  Or, to put it another way,
how many do you keep lying around?  I think mailrus is at about 30 days right now.


Edward Vielmetti, U of Michigan mail group.

matt@oddjob.UChicago.EDU (My Name Here) (04/02/88)

Rick Adams writes:
) ... I think it is important that "major" sites keep the old history.
) ... by doing so [they] will "defend" the "minor" sites from old articles.

It depends how you define "major".  Twice last month oddjob receieved
very large hiccups of old articles.  One seemd to be from the west
coast and one from the east.  As far as I could tell, each hiccup
went several hops to get to oddjob and never passed through one of
Spafford's "official" backbone sites.

				Matt Crawford

farren@gethen.UUCP (Michael J. Farren) (04/02/88)

In article <46774@sun.uucp> chuq@sun.UUCP (Chuq Von Rospach) writes:

[Discussing a means whereby the batch input files are held whole, while
 the history file contains pointers to it]

>o You lose the "Expires:" header. Stuff that is supposed to stay longer
>	can't.

Why?  If you modify the history file format (which you're going to have
to do anyhow), you could simply add an "expire-date" field, which could,
if you want to get fancy, be either the article's own expire date, or
a calculated one based on how long your site wants to keep that article.
Then, running expire is a simple scan of the history file, zapping the
pointers that are out-of-date.  Once in a while, you'd want to do a
cross-check to see if all of the articles in a given batch are expired,
and if so, then remove that batch file.  This could be made quite
efficient if the batching software pre-sorted the batches into
some newsgroup heirarchical order, such as soc. batches, comp. batches,
etc.

>o You lose adjustable expirations. You can't expire talk.* faster, because
>	it's all stuck in with everything else.

See above.  Nothing says you can't expire some articles differently than
others, it's just a matter of when you zap the history file (read: index).
And if you're batching in discrete groups, you can just expire entire
batch files, instead of individual articles (presuming you are using a
local expiration, rather than the date in the article).

>o It isn't clean for locally posted or non-batched articles. At the simplest
>	layer, they're simply batches with single articles. But if you've 
>	got lots of local posting or non-batched articles floating around,
>	the system degenerates into a setup WORSE than the current system,
>	because the tree is completely flat. ooph.

True.  However, locally posted articles could be held in a temporary
holding pattern, and the batch files generated when they're batched for
transmission could then replace them, and be handled just like the others.
Or, you could batch them up as they are posted, closing the batch file
and opening a new one when the first one got too large.

Non-batched articles would be a special case, and would have to be
batched as they arrived, perhaps in a special "non-batched" batch, for
local use only.  If you're providing a full feed, they'd just get batched
up for the next site down anyhow.   Also - how many non-batched articles
does a typical site see?  I haven't seen any for months, but I don't know
if I'm typical or not.

You do lose some stuff with a scheme like this, such as the easy ability
to manipulate individual articles (you'd have to extract them individually,
which is a loss of efficiency), but you'd also gain some.  You would no
longer necessarily have to maintain a fairly enormous directory tree -
batches could conceivably be kept in a much more compact structure.  If
the history file contains the Subject: line, you could build a utility
quite easily which would allow "K"illing articles by each user in a
much more efficient manner than the present one of looking at each
individual article.  And if you had enormous amounts of CPU (well, I
can dream, can't I? :-) you could even implement some sort of
compression scheme, allowing you to keep a lot more articles on-line
at any given time, or use less disk, whichever you preferred.

Cross-posted articles, by the way, would just be duplicate pointers
to the same batched article.  A little less efficient than the present,
but not bad.

-- 
Michael J. Farren             | "INVESTIGATE your point of view, don't just 
{ucbvax, uunet, hoptoad}!     | dogmatize it!  Reflect on it and re-evaluate
        unisoft!gethen!farren | it.  You may want to change your mind someday."
gethen!farren@lll-winken.llnl.gov ----- Tom Reingold, from alt.flame