stuart@bms-at.UUCP (Stuart D. Gathman) (03/23/88)
I have been using the Bnews software for about 2 years now. It is very useful for in house as well as usenet purposes. I have had some ideas about implementing the article storage. Please send flames via E-mail. (Really!) A major problem with the current system is scanning article headers in many seperate files. (And unix doesn't like big directories to boot.) My idea is to have 2 files per newsgroups directory (other than sub- directories). All headers for a news group would be in one file with offsets into another file containing all articles. Processing incoming news would then be faster. Programs like 'vn' would be orders of magnitude faster. The only problem is 'expire'. I maintain that 'expire' would still be reasonable. It would work by reading the header file and writing a new version for each newsgroup. It can seek past articles to be deleted while copying the ones to be retained to a new file. When finished, move the new versions into place. This needs to be done only one newsgroup at a time, so there is no disk space problem. Not only that, but no history file is needed! (The actual arrival time can be stored in the header file if desired.) Since expire is run on a batch basis, its decreased performance is not a problem. (At least compared with the current slow performance of interactive programs.) The time to read the header file is comparable to that required to read the current history files. It can be reduced by storing only some headers in the header file. The rest can be stored in the article file. Checking an incoming article to see if it is already present would be faster than the current SysV scheme of the 10 history files because a single header file will be much smaller than a 10th of the history files. (A dbm file could still be used also.) This assumes that the newsgroup information is consistent in duplicate articles. If only the article ID should be relied on, a history database of some form is still necessary. This method also uses less inodes. A problem on small systems. The only improvement on this method I can think of would be to introduce an indexing file structure that handles variable length records. But this loses the advantage of simplicity. -- Stuart D. Gathman <stuart@bms-at.uucp> <..!{vrdxhq|daitc}!bms-at!stuart>
lyndon@ncc.UUCP (Lyndon Nerenberg) (03/24/88)
In article <649@bms-at.UUCP> stuart@bms-at.UUCP (Stuart D. Gathman) writes:
A major problem with the current system is scanning article headers
in many seperate files. (And unix doesn't like big directories to boot.)
My idea is to have 2 files per newsgroups directory (other than sub-
directories). All headers for a news group would be in one file
with offsets into another file containing all articles.
Processing incoming news would then be faster. Programs like 'vn' would
be orders of magnitude faster. The only problem is 'expire'. I maintain
that 'expire' would still be reasonable. It would work by reading the
header file and writing a new version for each newsgroup. It can seek
past articles to be deleted while copying the ones to be retained
to a new file. When finished, move the new versions into place. This
needs to be done only one newsgroup at a time, so there is no disk
space problem.
There is some potential for trouble on System V implementations using
this method. Most USG systems are configured with a ulimit of 1 meg.
If you exceed this limit within a particular newsgroup, articles will
be lost when the append to the file fails.
There are two solutions:
1) Increase the ulimit by reconfiguring the kernel, or
2) Make rnews setuid to root, allowing it to increase
its ulimit value.
In many cases, 1) is impossible (or very difficult), depending on
how the vendor has packaged their USG port. Also, an administrator
may have a good reason for maintaining a small(er) ulimit value.
As for making rnews setuid to root, I have to be a bit nervous about
the idea given the recent discussion on setuid security. I would want
to look at the rnews source quite closely before implementing it in
this manner.
Not that I'm trying to knock the idea - it looks promising.
--
--lyndon {alberta,uunet}!ncc!lyndon lyndon%ncc@uunet.uu.net
wunder@hpcea.CE.HP.COM (Walter Underwood) (03/24/88)
/ hpcea:news.software.b / stuart@bms-at.UUCP (Stuart D. Gathman) / 12:17 pm Mar 22, 1988 / My idea is to have 2 files per newsgroups directory (other than sub- directories). All headers for a news group would be in one file with offsets into another file containing all articles. ---------- Congratulations. You have just invented Notesfiles. Of course, Ray Essick invented it in 1981. Didn't somebody else re-invent uuencode two weeks ago? wunder
rsalz@bbn.com (Rich Salz) (03/24/88)
>In article <649@bms-at.UUCP> stuart@bms-at.UUCP (Stuart D. Gathman) writes: > A major problem with the current system is scanning article headers > in many seperate files. (And unix doesn't like big directories to boot.) > My idea is to have 2 files per newsgroups directory (other than sub- > directories). All headers for a news group would be in one file > with offsets into another file containing all articles. This is basically how notes stores stuff. It can get you some speed benefits, but makes updates (e.g., expire, cancelled articles, new articles) much harder, and some things much slower. Currently, to add a new article, you need merely lock the active file for a moment to bump a number. This way you'd have to lock two big files for a significant amount of time. You'd also have to change the .newsrc (and all the news readers), unless you had a third file that mapped "article number" to header file offset. (You need to be able to remove expired articles and close up holes in the header file, lest it grow without bound.) Also having one article/file makes hundreds of Unix utilities more useful than they'd otherwise be (doing grep on a notes "text" file brings no joy; I've done it). /r$ -- Please send comp.sources.unix-related mail to rsalz@uunet.uu.net.
chuq@plaid.Sun.COM (Chuq Von Rospach) (03/24/88)
>It can get you some speed benefits, but makes updates (e.g., expire, >cancelled articles, new articles) much harder, and some things much >slower. Currently, to add a new article, you need merely lock the active >file for a moment to bump a number. This way you'd have to lock two big >files for a significant amount of time. At one point, I circulated a preliminary design that did something similar. Basically, you took the batch file, and instead of splitting it up, you simply stuck filename and lseek pointers into the history file for each article. No unpacking overhead at all, and significantly reduced filesystem and system call overhead. Expiration is fairly trivial, too. A batchfile gets over a certain age, you zap it and update the pointers. It got shot down then for a number of reasons, generally good ones. o You lose the "Expires:" header. Stuff that is supposed to stay longer can't. o You lose adjustable expirations. You can't expire talk.* faster, because it's all stuck in with everything else. Not useful for small disk systems (and this is back when usenet traffic was really heavy, like almost 8 megabytes a month!) o It isn't clean for locally posted or non-batched articles. At the simplest layer, they're simply batches with single articles. But if you've got lots of local posting or non-batched articles floating around, the system degenerates into a setup WORSE than the current system, because the tree is completely flat. ooph. One possibility is to create two files for every newsgroup, a pointer file, an index file, and a data file. The pointer file has three pieces of data: the message number, the lseek pointer and the Subject line (maybe a fourth, the poster). You add new entries to the bottom of the file, once a day a program runs that expires articles by zapping the data and rewriting the pointer file with the updated pointers. The fun part is the data. You want to avoid doing a copy of the data file each night. So rather than brute force, you get fancy. Since the file is split up into known size blocks, you can simply delete an article in place and mark the block free. When a new article comes in, it looks for a first fit (or best fit, I don't care) in the free space and writes itself into the block, using the index file to look for free space without having to linearly search the data file (perhaps the index file could be put in the data file in the block header info....). If there's no space, append it to the end and increase the size of the file. This would require a program that would allow an admin to scrunch the file as well, to zap holes and fragmentation. But unless you get into a space crunch, you shouldn't have to run it. Thinking about it, this has another interesting advantage. In theory, you could get rid of the expiration phase completely. Instead, set a maximum size for each newsgroup. When a new message comes in, if if doesn't fit in the file, the installation program goes to the oldest message and deletes it, and continues to delete messages until it does fit. This requires a more sophisticate indexing scheme than above, because you'd likely want to get into garbage collection and compacting fragmented space. But ALL of the maintenance of the files would be done at article creation time, with no midnight processing required. Sounds like a neat masters thesis, if you ask me. comments? Chuq Von Rospach chuq@sun.COM Delphi: CHUQ Speed it up. Keep it Simple. Ship it on time. -- Bill Atkinson
sl@van-bc.UUCP (pri=-10 Stuart Lynne) (03/27/88)
In article <10150@ncc.UUCP> lyndon@ncc.UUCP (Lyndon Nerenberg) writes: >In article <649@bms-at.UUCP> stuart@bms-at.UUCP (Stuart D. Gathman) writes: > > A major problem with the current system is scanning article headers > in many seperate files. (And unix doesn't like big directories to boot.) > My idea is to have 2 files per newsgroups directory (other than sub- > directories). All headers for a news group would be in one file > with offsets into another file containing all articles. >There is some potential for trouble on System V implementations using >this method. Most USG systems are configured with a ulimit of 1 meg. >If you exceed this limit within a particular newsgroup, articles will >be lost when the append to the file fails. > >There are two solutions: > > 1) Increase the ulimit by reconfiguring the kernel, or > 2) Make rnews setuid to root, allowing it to increase > its ulimit value. > Actually there is a third solution. The original suggestion was to store headers in one file and all articles in a second file. The rationale being to reduce significantly the use of Inodes, processing time for news readers etc. This could be modified to be n+1 files, the first file stores the headers, plus offset *and* file name for each article. Instead of storing all articles simply spread them among several (n) files. Several different types of criteria could be applied to place the articles among the files: 1. a new file everytime the 1MB ulimit was in danger of being broached 2. a new file every day/week/arbitrary period 3. multiple files based on keyword/subject/other sorting criteria 4. a new file for each article The fourth option is essentially status quo except for moving the history file for /usr/lib/news/history to be distributed into the news spool directories. I do see some problems for cross posting. But certainly they wouldn't be insurmountable. -- {ihnp4!alberta!ubc-vision,uunet}!van-bc!Stuart.Lynne Vancouver,BC,604-937-7532
gnu@hoptoad.uucp (John Gilmore) (03/27/88)
lyndon@ncc.UUCP (Lyndon Nerenberg) wrote: > There is some potential for trouble on System V implementations using > this method. Most USG systems are configured with a ulimit of 1 meg. I think this is considered a bug by one and all. (At least by me!) If we distribute a netnews release that requires that the bug be fixed, the *&^$%# vendors who ship systems with this bug will scramble to fix it. Saying "we shouldn't speed up netnews with a new architecture because of a System V bug" is like saying "we shouldn't run netnews in the current architecture because a System V bug loses inodes when you do". The solution is to fix the bug, not to stop working on making netnews faster and easier to manage. I am pretty happy with the speed of "rnews" under C news. Where I want more performance is on "expire"; it takes hoptoad more than an hour to run a single expire each night, and disk response on that drive goes to hell. I *think* the problem is in the dbm code. Note that expire spends most of its time handling entries for already-expired messages, since typically the message-ID-retention is set to e.g. 28 days while the expire threshold is at 14 or 10 or 8. Nevertheless, it seems to grind up the disk even while processing these, and all it's doing is reading a line, doing an ftell(), writing the line out, and adding a dbm entry. If somebody gets fired up to rewrite dbm in the public domain, please make an "in core" option where it won't write to the disk at all, just malloc the blocks it wants, and keep track of where it wants to put them. My dbm files are under 3MB total, and if I can keep most of that in memory while building them, we should be able to zip right through the expired stuff in no time, without moving the disk heads at all. -- {pyramid,ptsfa,amdahl,sun,ihnp4}!hoptoad!gnu gnu@toad.com "Watch me change my world..." -- Liquid Theatre
geoff@utstat.uucp (Geoff Collyer) (03/28/88)
I rebuilt the dbm history files from the ASCII history file on dalcs (a Sun 4 with 32Mb of main memory) in 23 seconds elapsed time. dalcs gets a full news feed and it took five minutes just to build the ASCII history file using find and some small filters. With 32Mb of memory, the buffer cache was 3.2Mb, and that gives essentially the same speedup that John Gilmore proposed to achieve by modifying dbm itself. I recently increased the buffer cache on our file servers to 25% of available memory rather than 10%, as the servers used to have plenty of free memory, even at busy times. (Those of you on Suns can just patch _bufpages in /vmunix to be the number of pages to devote to the cache, then reboot.) I haven't timed expire recently and in any case we get a small news feed here. -- Geoff Collyer utzoo!utstat!geoff, utstat.toronto.{edu,cdn}!geoff
davidsen@steinmetz.steinmetz.ge.com (William E. Davidsen Jr) (03/28/88)
In article <4246@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes: | [...] | Saying "we shouldn't speed up netnews with a new architecture because of a | System V bug" is like saying "we shouldn't run netnews in the current | architecture because a System V bug loses inodes when you do". The | solution is to fix the bug, not to stop working on making netnews faster | and easier to manage. You stated that pretty poorly. Obviously having the ulimit (or any other configurable parameter) set to an inappropriate value is a "vendor problem" rather than a "System V bug." There is a fix for that at boot time, but I don't remember what it is just now, although I used some time ago. I think it would be unwise to speed up expire or anything else by using 3-4 MB of memory (forgive me if I misread your intension). There are a LOT of machines which don't have that much memory, real or virtual. The use of a dbm allows fast access to data too large for memory. AT&T states that Xenix represents 60+% of all systems sold (by systems, not users). Add in the unix-pc and other small boxes, and it seems that there are more systems which would have trouble than would not. Add in the VAXen which wouldn't like someone running a 3-4MB program even with virtual memory, and it's probably not a good idea. I think news has done a good job of balancing efficiency with usefulness, and I would hate to have a new version not usable on all the PDP-11's, AT's, unix-pc's, small VAXen, etc. There are a lot of diskless workstations which wouldn't do well paging over a network, either. I hope I misread your comment, or that you were making it as an observation, rather than what was really intended or desirable. If programs on large memory machines can run faster, great, I will use all the memory in my personal system, while sparing the little box I use at work. -- bill davidsen (wedu@ge-crd.arpa) {uunet | philabs | seismo}!steinmetz!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
lyndon@ncc.UUCP (Lyndon Nerenberg) (03/28/88)
In article <4246@hoptoad.uucp>, gnu@hoptoad.uucp (John Gilmore) writes: > lyndon@ncc.UUCP (Lyndon Nerenberg) wrote: > > There is some potential for trouble on System V implementations using > > this method. Most USG systems are configured with a ulimit of 1 meg. > > I think this is considered a bug by one and all. (At least by me!) > If we distribute a netnews release that requires that the > bug be fixed, the *&^$%# vendors who ship systems with this bug will > scramble to fix it. I doubt that very much - if the vendors paid any attention to what we say on the net there wouldn't be any bugs left :-) I agree that ulimit is braindamaged. I *love* my new 3/280 :-) However, I still try to write my code so it will work on these braindead machines. After all, I spent four years working on one, and I have to feel sorry for those people who (for whatever reason) can't upgrade. Fortunately, many of the people who posted source to the net tried to make their code portable (no flexnames, properly ifdef'd job control stuff, etc) - to them I am forever grateful! Does anyone have *ACCURATE* numbers showing the number of USG 5.x ( x < 5.3) machines vs. BSD 4.x (x > 1). I have a funny feeling that there are a lot more of these low IQ boxes out there than anyone imagines. I don't think it's fair to cut off 40% of the net just because AT&T is out to lunch... -- lyndon {alberta,uunet}!ncc!lyndon lyndon%ncc@uunet.uu.net
gnu@hoptoad.uucp (John Gilmore) (03/28/88)
davidsen@steinmetz.steinmetz.ge.com (William E. Davidsen Jr) wrote: > Obviously having the ulimit (or any other > configurable parameter) set to an inappropriate value is a "vendor > problem" rather than a "System V bug." Not if in the default System V, that AT&T supplies to all the vendors, the parameter is configured wrong. That's a bug. It's a bug that Sun ships SunOS configured to run out of text table entries if you bring up seven windows and run the compiler. Just because you can fix it without sources doesn't mean it isn't a bug. If it's a "vendor problem" howcum all the Sys V vendors have the problem? The default file size limit should always be no limit. People who want a limit can always reduce it in their .login file, but the way it's implemented you can't *increase* it if you don't like it, the way you can in BSD. I don't see why J Random User would ever want to limit the size of a file he can manipulate though. It's useless as a system management tool; somebody who wants to fill the file system can always just create N files, each of 1 megabyte. I've heard the stories about saving yourself from runaway processes filling the disk; tell me, when is the last time that happened to *you*? And it only took one "rm" command to fix it, right? Ulimit is just a hassle with no benefit, aka a bug. > I think it would be unwise to speed up expire or anything else by using > 3-4 MB of memory (forgive me if I misread your intension). There are a > LOT of machines which don't have that much memory, real or virtual. My intention was to make an *option* to use large virtual memory to turn a 1.5 hour process into a 10 minute process. People on small machines are welcome to thrash their disk heads for 90 minutes; I just don't see any reason to make my (4MB physical memory) Sun do that, when a pretty simple option in the code would eliminate it. ====== On the ulimit question, I find it really hard to believe that *anybody* seriously thinks we should write all our applications such that they will never, under any circumstances, produce files larger than 1MB, because AT&T busted one of its releases. While we're at it, let's make sure that nobody is allowed to type any lines wider than 80 columns, and build file systems that can only hold 16 megabytes before you have to partition your disk. Since AT&T doesn't support TCP/IP, let's tear down the Internet, toss NNTP, and go back to 300 baud modems. No, wait! Let's build a whole computer so brain damaged that you can never use more than 64K of data at once, even though the machine can have megabytes of main memory! Oh...your hardware has lots of disk and columns and networks and main memory? No problem, we'll get AT&T to set a limit in software! Now we're getting somewhere... :-) -- {pyramid,ptsfa,amdahl,sun,ihnp4}!hoptoad!gnu gnu@toad.com "Watch me change my world..." -- Liquid Theatre
paul@vixie.UUCP (Paul Vixie Esq) (03/30/88)
In article <4265@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes: > While we're at it, let's make >sure that nobody is allowed to type any lines wider than 80 columns, >and build file systems that can only hold 16 megabytes before you have >to partition your disk. Hey, John, as long as we're going to improve things, let's set a 14-character limit on filenames. Let's make sure that you have to be super-user to move a subdirectory into a different parent. Can't be too careful, you know. I wish I felt :-) about this... But I don't. -- Paul A Vixie Esq paul%vixie@uunet.uu.net {uunet,ptsfa,hoptoad}!vixie!paul San Francisco, (415) 647-7023
davidsen@steinmetz.steinmetz.ge.com (William E. Davidsen Jr) (03/30/88)
In article <4265@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes: | [...] | My intention was to make an *option* to use large virtual memory to turn | a 1.5 hour process into a 10 minute process. People on small machines | are welcome to thrash their disk heads for 90 minutes; I just don't | see any reason to make my (4MB physical memory) Sun do that, when a | pretty simple option in the code would eliminate it. I think that's what I said in part of my posting you didn't quote. What I want to avoid is *requiring* a lot of memory, not using it if you have it. | On the ulimit question, I find it really hard to believe that *anybody* | seriously thinks we should write all our applications such that they | will never, under any circumstances, produce files larger than 1MB, | because AT&T busted one of its releases. While we're at it, let's make | sure that nobody is allowed to [ lines and lines of stuff ] You can approach this is (at least) three ways. You can set the program to never produce a large file (which I didn't propose), so it will seldon produce a large file (which I do), or so that it will almost always produce a large file (such as putting all news for all groups in one file or somesuch) which isn't proposed at the moment, either. The choice of setting options with restrictive limits or not is a matter of opinion, and not a technical issue. For that reason I don't feel that either of us will convince the other, although I still feel that your classification of a choice you don't like as a "bug," might mislead someone into thinking that it doesn't work as documented. The consequences of having a low limit are that a few programs may fail. A high limit will result in having the first program hung in a write loop run a filesystem out of space. If this is a user filespace other users won't be able to save their work, while if it's the system tmp space some compilers and editors will stop working. ulimit is not useful against a malicious user who wants to hurt the system, pnly for the klutz who loops a program (like someone learning C or F77, perhaps). -- bill davidsen (wedu@ge-crd.arpa) {uunet | philabs | seismo}!steinmetz!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
wcs@ho95e.ATT.COM (Bill.Stewart.<ho95c>) (03/30/88)
In article <4265@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes: :it takes hoptoad more than an hour to run a single expire each night, ho95e (an ancient 3B20, with 1MB ulimit) expires in 10 minutes. This is for expire -e6 -E10 -n !comp ; on even-numbered nights I expire everything with -e10, which probably takes twice as long?. I'm running 2.11.8, with no -dbm (sigh; why isn't it SysV? It was in V7) so I'm using the big-history-file method. (A 3B20 is a 1-MIPS bit-sliced machine, built when we were The Phone Company and 4MB was very big.) : [..]. I *think* the problem is in the dbm code. Note that expire :spends most of its time handling entries for already-expired messages, :since typically the message-ID-retention is set to e.g. 28 days while :the expire threshold is at 14 or 10 or 8. Is it really necessary to retain message-IDs past about 14 days? The news.lists postings indicate 99% of everything reaches uunet in 5 days; surely there aren't too many 15-day-long loops any more? Probably better to keep the expire time down and tolerate the <.1% duplication load. (Admittedly, one reason there aren't many delay loops is that the backbone machines probably keep a lot of history.) For ho95e, the retention limit is forced on us by ulimit; we get history file overflows if we keep records longer than 11 days, and VOLUME is up. In article <4246@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes: : [various true things about SysV ulimit braindamage] : [maybe if everyone bitches at their vendors simultaneously they'll fix it] Naaah... At least SVR3 gives you configurable amounts of braindamage. -- # Thanks; # Bill Stewart, AT&T Bell Labs 2G218, Holmdel NJ 1-201-949-0705 ihnp4!ho95c!wcs # So we got out our parsers and debuggers and lexical analyzers and various # implements of destruction and went off to clean up the tty driver...
heiby@falkor.UUCP (Ron Heiby) (03/30/88)
John Gilmore (gnu@hoptoad.uucp) writes: > davidsen@steinmetz.steinmetz.ge.com (William E. Davidsen Jr) wrote: > > Obviously having the ulimit (or any other > > configurable parameter) set to an inappropriate value is a "vendor > > problem" rather than a "System V bug." > > If it's a "vendor problem" howcum all the Sys V vendors have the problem? Proof by counter-example: Motorola Microcomputer Division's VME Delta Series computers use System V Release 3.0 and have "ulimit" as a tunable parameter. So, you see, at least one vendor has "fixed" this "bug". (A bit of history: When AT&T entered the computer business (commercially), some of us (I used to work there.) saw the 1Meg ulimit as a bad thing. Over the course of about two years, we managed to get UNIX development to make it a tunable. It is tunable in AT&T's Release 3.1. The value is referenced a total of ONE place in the kernel.) -- Ron Heiby, heiby@mcdchg.UUCP Moderator: comp.newprod & comp.unix "I believe in the Tooth Fairy." "I believe in Santa Claus." "I believe in the future of the Space Program."
david@ms.uky.edu (David Herron -- Resident E-mail Hack) (03/31/88)
In article <2090@ho95e.ATT.COM> wcs@ho95e.UUCP (46323-Bill.Stewart.<ho95c>,2G218,x0705,) writes: >Is it really necessary to retain message-IDs past about 14 days? The >news.lists postings indicate 99% of everything reaches uunet in 5 days; >surely there aren't too many 15-day-long loops any more? Probably better >to keep the expire time down and tolerate the <.1% duplication load. >(Admittedly, one reason there aren't many delay loops is that the >backbone machines probably keep a lot of history.) For ho95e, the >retention limit is forced on us by ulimit; we get history file overflows >if we keep records longer than 11 days, and VOLUME is up. I had the chance recently to be looking closely at what news was coming through the system and saw "quite a bit" of stuff which was over a month old. -- <---- David Herron -- The E-Mail guy <david@ms.uky.edu> <---- or: {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET <---- <---- I don't have a Blue bone in my body!
gerry@syntron.UUCP (G. Roderick Singleton) (03/31/88)
In article <4265@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes: >davidsen@steinmetz.steinmetz.ge.com (William E. Davidsen Jr) wrote: >> Obviously having the ulimit (or any other >> configurable parameter) set to an inappropriate value is a "vendor >> problem" rather than a "System V bug." > >Not if in the default System V, that AT&T supplies to all the vendors, [almost whole message deleted] >wait! Let's build a whole computer so brain damaged that you can never >use more than 64K of data at once, even though the machine can have >megabytes of main memory! Oh...your hardware has lots of disk and [smiley and 2 lines deleted ] Smiley or not, you'll have to get ATT to do something else. IBM has already created this monster and sold it to an unsuspecting public. Oh well, ATT can try too :-) ger -- G. Roderick Singleton, Technical Services Manager { syntron | geac | eclectic }!gerry "ALL animals are created equal, BUT some animals are MORE equal than others." George Orwell
rick@seismo.CSS.GOV (Rick Adams) (04/01/88)
95% of the sites should not bother with keeping a history of expired artciles. However, I think it is important that "major" sites keep the old history. They usually have the resources to do it, and by doing so will "defend" the "minor" sites from old articles. --rick
emv@starbarlounge (Edward Vielmetti) (04/01/88)
In article <44282@beno.seismo.CSS.GOV> rick@seismo.CSS.GOV (Rick Adams) writes: >95% of the sites should not bother with keeping a history of expired >artciles. However, I think it is important that "major" sites keep the >old history. They usually have the resources to do it, and by doing so >will "defend" the "minor" sites from old articles. > >--rick How many days of expired articles are necessary? Or, to put it another way, how many do you keep lying around? I think mailrus is at about 30 days right now. Edward Vielmetti, U of Michigan mail group.
matt@oddjob.UChicago.EDU (My Name Here) (04/02/88)
Rick Adams writes:
) ... I think it is important that "major" sites keep the old history.
) ... by doing so [they] will "defend" the "minor" sites from old articles.
It depends how you define "major". Twice last month oddjob receieved
very large hiccups of old articles. One seemd to be from the west
coast and one from the east. As far as I could tell, each hiccup
went several hops to get to oddjob and never passed through one of
Spafford's "official" backbone sites.
Matt Crawford
farren@gethen.UUCP (Michael J. Farren) (04/02/88)
In article <46774@sun.uucp> chuq@sun.UUCP (Chuq Von Rospach) writes: [Discussing a means whereby the batch input files are held whole, while the history file contains pointers to it] >o You lose the "Expires:" header. Stuff that is supposed to stay longer > can't. Why? If you modify the history file format (which you're going to have to do anyhow), you could simply add an "expire-date" field, which could, if you want to get fancy, be either the article's own expire date, or a calculated one based on how long your site wants to keep that article. Then, running expire is a simple scan of the history file, zapping the pointers that are out-of-date. Once in a while, you'd want to do a cross-check to see if all of the articles in a given batch are expired, and if so, then remove that batch file. This could be made quite efficient if the batching software pre-sorted the batches into some newsgroup heirarchical order, such as soc. batches, comp. batches, etc. >o You lose adjustable expirations. You can't expire talk.* faster, because > it's all stuck in with everything else. See above. Nothing says you can't expire some articles differently than others, it's just a matter of when you zap the history file (read: index). And if you're batching in discrete groups, you can just expire entire batch files, instead of individual articles (presuming you are using a local expiration, rather than the date in the article). >o It isn't clean for locally posted or non-batched articles. At the simplest > layer, they're simply batches with single articles. But if you've > got lots of local posting or non-batched articles floating around, > the system degenerates into a setup WORSE than the current system, > because the tree is completely flat. ooph. True. However, locally posted articles could be held in a temporary holding pattern, and the batch files generated when they're batched for transmission could then replace them, and be handled just like the others. Or, you could batch them up as they are posted, closing the batch file and opening a new one when the first one got too large. Non-batched articles would be a special case, and would have to be batched as they arrived, perhaps in a special "non-batched" batch, for local use only. If you're providing a full feed, they'd just get batched up for the next site down anyhow. Also - how many non-batched articles does a typical site see? I haven't seen any for months, but I don't know if I'm typical or not. You do lose some stuff with a scheme like this, such as the easy ability to manipulate individual articles (you'd have to extract them individually, which is a loss of efficiency), but you'd also gain some. You would no longer necessarily have to maintain a fairly enormous directory tree - batches could conceivably be kept in a much more compact structure. If the history file contains the Subject: line, you could build a utility quite easily which would allow "K"illing articles by each user in a much more efficient manner than the present one of looking at each individual article. And if you had enormous amounts of CPU (well, I can dream, can't I? :-) you could even implement some sort of compression scheme, allowing you to keep a lot more articles on-line at any given time, or use less disk, whichever you preferred. Cross-posted articles, by the way, would just be duplicate pointers to the same batched article. A little less efficient than the present, but not bad. -- Michael J. Farren | "INVESTIGATE your point of view, don't just {ucbvax, uunet, hoptoad}! | dogmatize it! Reflect on it and re-evaluate unisoft!gethen!farren | it. You may want to change your mind someday." gethen!farren@lll-winken.llnl.gov ----- Tom Reingold, from alt.flame