[news.software.b] Is the history file really needed anymore?

david@ukma.ms.uky.csnet (David Herron, NPR Lover) (01/14/87)

I got to thinking the other morning and decided we're close to not
needing to have the history file any more.

My idea was to have a master directory in which the file names are
the article-i.d. (minus the '<' and '>').  here are the pro's and
con's I considered at the time.

1) knowing which links point to this file.  (i.e. which newsgroup
   article number pairs are associated with this article).

   The Xrefs: header takes care of that right now ... the information
   is in approximately the same format as in the history file.

2) knowing when to cancel an article.

   Well, the ctime of the inode will be able to tell us a lot about
   when to do this ... the rest would come from opening the article
   and looking for the Expires: line.  The only new limitation would
   be knowing when there was an Expires: line present which specified
   a shorter-than-default time.  I wonder if anybody ever does this?

3) knowing when articles have been canceled.

   currently this is done by having an entry in the history file that
   says it has been canceled.  I've seen some *really*old* entries
   of that sort (I don't remember which news version) ... Anyway a
   similar thing could be done by just leaving a file in the master
   directory.

4) the Unix behaviour of being quadratic on directory searches.

   hmmm ... offhand I'd say that the only program which would directly
   use the master directory is expire and it should only be run once a day.
   Further, expire could be written to be intelligent about the way
   it runs through the directories.  (i.e. not a lot of random looking
   about in the directory, but linear search ... )

5) article id's are too long for non-flexname-in-the-file-system-people.

   I'm tempted to just say "aaawwwww" but that's not fair.  Something
   could be done like split it into a heirarchy of some sort...
   (which would help in point 4 as well).  I'm thinking of the way
   terminfo keeps its' terminal descriptions.

6) we'd be able to get away from dbm...

   it's a "yay" for xenix and sysv people ... they'd no longer be second
   class citizens.  we'd also be able to get away from the hackery in
   place to make sure the history file doesn't get corrupted by insertions
   happening during an expire (etc).

   [I did kind of like the comment in the 2.11 installation doc about
    how to get dbm if you don't have it already.]

Are there any other issues?  Like I said, it's just random thinking.


-- 
-----   David Herron,  cbosgd!ukma!david, david@UKMA.BITNET, david@ms.uky.csnet
-----           (also "postmaster", "news", "netnews", "uucp", "mmdf", and ...)
-----                                    (and the map maintainer for Kentucky.)
-----       "Don't put your money in South Africa -- Give it to me!" -- Cerebus

lwall@sdcrdcf.UUCP (Larry Wall) (01/15/87)

david@ukma.ms.uky.csnet (David Herron) proposes doing away with the history
file and using a master directory of article-i.d.s linked to the articles,
and says:
>   The Xrefs: header takes care of that right now ... the information
>   is in approximately the same format as in the history file.

Currently you don't have to open the article to get that info.  Of course,
you don't need the info unless you are actually going to delete the article.

>   hmmm ... offhand I'd say that the only program which would directly
>   use the master directory is expire and it should only be run once a day.
>   Further, expire could be written to be intelligent about the way
>   it runs through the directories.  (i.e. not a lot of random looking
>   about in the directory, but linear search ... )

You'd have to scan the entire directory every time you want to install an
article.  This is the killer.

>5) article id's are too long for non-flexname-in-the-file-system-people.
>
>   I'm tempted to just say "aaawwwww" but that's not fair.  Something
>   could be done like split it into a heirarchy of some sort...
>   (which would help in point 4 as well).

Add in the overhead of making new directories via a separate process part
of the time.  Add in the overhead of scanning periodically to eliminate
empty directories, though I suppose expire could do this easily enough.
Add in the hassle of making expire work recursively.

>6) we'd be able to get away from dbm...
>
>   it's a "yay" for xenix and sysv people ... they'd no longer be second
>   class citizens.  we'd also be able to get away from the hackery in
>   place to make sure the history file doesn't get corrupted by insertions
>   happening during an expire (etc).

You'd still have to be careful about insertions in the case of two rnewses
running simultaneously.  The link call probably provides sufficient locking
capability, but subdirectories would complicate matters.

>Are there any other issues?  Like I said, it's just random thinking.

Speaking of random thinking, eunice systems don't have linking.  You'd have
to emulate it, something like the LINKART code I threw into inews once upon a
time.  Eunice systems also tend to get slow when they have to think about
filenames that don't fit into the VMS filename straitjacket.

Larry Wall
{allegra,burdvax,cbosgd,hplabs,ihnp4,sdcsvax}!sdcrdcf!lwall

jerry@oliveb.UUCP (Jerry F Aguirre) (01/16/87)

Regarding eliminating the news/history file:

I think this is a great idea.  The idea of a separate history and news
data base always introduces the problem having the two disagree.
Besides, the history file, even with the DBM copy, is not a very
efficient way to access articles.  Even the common problem of trying to
find an article from its article ID is inefficient.

The directory search and length problems are no different than those
introduced by the news group names.  Comp.os.unix gets mapped to
comp/os/unix, changing the dots to directory levels.  For article IDs I
think something useful could be done with this like mapping:

		<5504@ukma.ms.uky.csnet>
to:
		csnet/uky/ms/ukma/5504

This has the advantage of grouping all articles posted by a given site
in the same directory.  The only problem with a scheme like this is that
it places a more restrictions on the format of the ID.  Written
correctly the software could provide for some less restrictive format
when it ran into a nonstandard ID.  (I have never seen any IDs not in
the above format.)

To cancel an article mearly map the ID to a pathname and create a zero
length file.  This will hold the article ID while eliminating the data
portion of the article.

Eliminating the history file does force expire to do more work.  Perhaps
another link could be created for each article based on when it is
supposed to be expired.  A directory like "expire_date/17" could hold a
link to all articles that are due to expire on the 17th of the month.
In this case almost every article expire read would be one it would have
to remove anyway.  Expire could then use the "Xref:" line to track down
all links to each article and remove them.

This scheme would not be perfect, but then again neither is the current
one.  Every so often I have to track down articles that are not in the
history and therefor don't get expired.  Please don't tell me about the
nohistory/rebuild options to expire.  They violate all the advantages of
the history file and don't correctly handle corrupted articles.

					Jerry Aguirre

matt@ncr-sd.UUCP (Matt Costello) (01/16/87)

In article <5504@ukma.ms.uky.csnet> david@ukma.ms.uky.csnet (David Herron, NPR Lover) writes:
>I got to thinking the other morning and decided we're close to not
>needing to have the history file any more.
>
>My idea was to have a master directory in which the file names are
>the article-i.d. (minus the '<' and '>').  here are the pro's and
>con's I considered at the time.

How about having a separate history file for every newsgroup, located in
the newsgroup directory with a name of ".history"?  For those articles
residing in multiple newsgroups, the first newgroup on the Newsgroups
header line would be used.  Its advantanges would be:

1.  This would cut down the size of each history file so that searching
    the history file would be fast.

2.  It is simple to understand and implement.

3.  It would allow expire to be run with different values in different
    newsgroups.  (It has always struck me as pretty silly that net.jokes
    survives the same length of time as mod.sources.)

-- 
Matt Costello, matt.costello@SanDiego.NCR.COM (registered w/ CSNET)
	{sdcsvax,cbatt,dcdwest,nosc.ARPA,ihnp4}!ncr-sd!matt

sewilco@mecc.UUCP (01/16/87)

Aren't we talking about an awful lot of inodes here?
-- 
Scot E. Wilcoxon   Minn Ed Comp Corp  {quest,dayton,meccts}!mecc!sewilco
(612)481-3507           sewilco@MECC.COM       ihnp4!meccts!mecc!sewilco
   
  National Enquirer seers: 4 		Reality: 360

henry@utzoo.UUCP (Henry Spencer) (01/16/87)

> 3.  It would allow expire to be run with different values in different
>     newsgroups.  (It has always struck me as pretty silly that net.jokes
>     survives the same length of time as mod.sources.)

This has nothing to do with the format of the history file.  It is quite
possible to write an expire that does selective expiry with the current
history-file format.  It isn't even hard.  (The C news expire, which I
wrote, does it.)

Eliminating the history file is, on the whole, a silly idea.  Precisely
what benefits is the change supposed to produce?  If your news system
lets articles get filed without putting them in the history file, this
is a defect of the implementation, not the concept.  Ditto for foulups in
coordination between inews and expire.  (What makes you think that the
implementation of the new concept will be any better?  Of course, it won't
foul up in precisely the same way...)

Making expire dig out the inode of each file to decide whether to expire
it would be an awful performance disaster.  Expire wins very heavily indeed
from having a central, efficiently-scanned database of articles.

The assumption that article-ids always have reasonable formats is terribly
naive.  Have any of the people suggesting this done *systematic* *surveys*
of article-ids?  I thought not.  The vast majority of article-ids are in
simple and reasonable format; it's the not-insignificant minority that
aren't that causes trouble.  (I speak from experience.)

The *only* thing I've seen mentioned which is really a significant advantage
for this idea is eliminating the need to go through dbm (or not go through
it, on lobotomized Unixes like System V).  Not because there is anything
grossly wrong with dbm, but simply because it isn't universal and it is not
ideally portable in some ways.  The fix for this is to write a public-domain
dbm equivalent.  Assuming that Unix directories will do it for you at high
efficiency is, again, terribly naive.

Just what is the big win from throwing out the history file, as opposed to
fixing the bugs in the current handling of it?
-- 
Legalize			Henry Spencer @ U of Toronto Zoology
freedom!			{allegra,ihnp4,decvax,pyramid}!utzoo!henry

hansen@pegasus.UUCP (01/17/87)

If you compile 2.11 without DBM defined, it will use a subdirectory called
LIBDIR/history.d as well as the history file. The history file is split into
10 pieces (files named 0-9) based on the last character before the '@'
within the message ID.

I have sent off mods to seismo!rick for inclusion in a future patch such
that the history file isn't even kept around; everything works off of the
history.d directory. So the answer is No, the history file isn't needed
anymore, and the code's already been written to get rid of it. It saves 1500
blocks of file space too!

If anyone wants the patch to do this, send mail.

(By the way, there are message ID's which have a "/" within them. This
wrecks havoc on any scheme to use the message ID itself as a file name.)

					Tony Hansen
					ihnp4!pegasus!hansen

heiby@falkor.UUCP (01/18/87)

In article <2924@pegasus.UUCP> hansen@pegasus.UUCP (60021254-Tony L. Hansen;LZ 3B-315;6243) writes:
>So the answer is No, the history file isn't needed
>anymore, and the code's already been written to get rid of it.

Tony is right that the news software no longer needs the history file.
However, I've found that I, as the news administrator, sometimes do.
I have found it useful in determining whether an article has arrived,
which newsgroup(s) it can be found in, etc.  Also, when one of my neighbors
has had a glitch where they have lost a day's news, but they have multiple
feeds, I've found it useful for creating a file of message-ids that
arrived at my site during the period in question.  That file of message-ids
can be easily packaged into an "ihave" control message.  So, I guess I'd
prefer to have the history file around, at least as an option which I
could select.
-- 
Ron Heiby, cuae2!falkor!heiby	Moderator: mod.newprod & mod.os.unix
	Between jobs this weekend.
"They are the best selling knives of its kind ever sold by us!" [sic(k)]

james@bigtex.uucp (James Van Artsdalen) (01/18/87)

I made slight modifications to the mdbm package and am using that as a dbm
for many programs, including news, on my System V machine.  It appears that
the overhead in mdbm associated with opening a database is high enough that
not a great deal is gained generally, although it does "feel" faster.  My
question is what is kept in the dbm database?  It appears to be little more
than an offset into the LIB/history file.  Are the LIB/history.d/[0-9] files
necessary or useful with dbm?  I deleted the LIB/history.d directory after
getting dbm running, with no apparent problems.  It appears that the dbm
database is nothing more than an index by message-id into the LIB/history file.
-- 
James R. Van Artsdalen   ...!ut-sally!utastro!bigtex!james   "Live Free or Die"
(512)-328-0282

wunder@hpcea.HP.COM (Walter Underwood) (01/19/87)

Relying on the ctime of the article sounds like you could get into
trouble with some varieties of tape backup, and strange versions of
Unix.  Yes, spool disks do get backed up, and they do get crashed.

If you haven't ever seen message-ID's that weren't of the form
<1234@vax.dom>, then here is a new experience for you.  All of these
ID's were extracted from my news archives.  Most of the strange one's
seem to be created by mailers, and posted into moderated groups.

I've randomly deleted about half of the non-atsign ID's that my search
found.

  <VAX-MM(180)+TOPSLIB(116)+PONY(0).26-Mar-86.09:08:28.SRI-IU.ARPA>
  <[USC-ISI.ARPA].8-Apr-86.13:30:51.ISAACSON>
  <VAX-MM(186)+TOPSLIB(117)+PONY(0).10-Apr-86.11:17:53.SRI-IU.ARPA>
  <[AI.AI.MIT.EDU].30508.860424.SILVER>
  <[USC-ISIB.ARPA].6-May-86.12:02:48.YAMAZAKI>
  <VAX-MM(187)+TOPSLIB(118).11-May-86.08:50:06.ISI-VENERA.ARPA>
  <VAX-MM(194)+TOPSLIB(120)+PONY(0).30-Jul-86.17:43:10.SRI-WARBUCKS.ARPA>
  <[OAK.SAINET.MFENET].701C0320.008F2E4C.SECRIST>
  <VAX-MM(162)+TOPSLIB(114)+PONY(0).3-Apr-86.15:31:35.ISI-VLSIG.ARPA>
  <VAX-MM(162)+TOPSLIB(114)+PONY(0).10-Apr-86.16:18:31.ISI-VLSIH.ARPA>
  <[AI.AI.MIT.EDU].30285.860423.KFL>
  <VAX-MM(162)+TOPSLIB(114)+PONY(0)..9-May-86.12:19:17.ISI-VLSIG.ARPA>
  <[USC-ISIE.ARPA]14-May-86.16:09:44.HCBROWN>
  <VAX-MM(194)+TOPSLIB(120)+PONY(0).15-May-86.23:08:42.SRI-IU.ARPA>
  <[G.BBN.COM]24-Jun-86.16:28:11.WOLF>
  <VAX-MM(194)+TOPSLIB(120)+PONY(0).25-Jun-86.16:07:48.SU-STAR.ARPA>
  <VAX-MM(194)+TOPSLIB(120)+PONY(0).3-Jul-86.11:03:10.SRI-IU.ARPA>
  <VAX-MM(194)+TOPSLIB(120)+PONY(0).6-Jul-86.23:07:30.SRI-IU.ARPA>
  <[MX.LCS.MIT.EDU].931527.860706.KFL>
  <[AI.AI.MIT.EDU].67100.860708.KFL>
  <[MX.LCS.MIT.EDU].932390.860710.KFL>
  <"MS11(5206)+GLXLIB5(0)".12221054599.25.595.46644.at.MARLBORO.DEC.COM>
  <[MX.LCS.MIT.EDU].932363.860710.KFL>
  <VAX-MM(194)+TOPSLIB(120)+PONY(0) 13-Jul-86 13:22:47.SRI-IU.ARPA>
  <[MX.LCS.MIT.EDU].936349.860726.KFL>
  <[G.BBN.COM]30-Jul-86.09:15:51.WOLF>
  <[OAK.SAINET.MFENET].AD56D260.008F2CA8.SECRIST>
  <[MC.LCS.MIT.EDU].850641.860314.GUMBY>
  <[USC-ISI.ARPA]18-Apr-86.18:24:14.CERF>
  <"MS11(5146)+GLXLIB0(4)-4".12199814474.19.126.16540.at.MARLBORO.DEC.COM>
  <"MS11(5146)+GLXLIB0(4)-4" 12199813787.19.126.7685 at MARKET>
  <[USC-ISI.ARPA]20-Apr-86.18:15:30.CERF>
  <[SU-SCORE.ARPA]20-Apr-86.16:43:03.BILLW>
  <[BUCS20.BU.EDU].JSOL.25-Apr-86.18:10:41>
  <[SRI-NIC.ARPA].7-May-86.04:09:01.STJOHNS>
  <VAX-MM(187)+TOPSLIB(118)..6-May-86.12:49:10.IPTO.ARPA>
  <[F.BBN.COM]19-May-86.16:09:16.JDELSIGNORE>
  <VAX-MM(187)+TOPSLIB(118).27-May-86.16:53:11.ISI-VENERA.ARPA>
  <[USC-ISIB.ARPA]28-May-86.01:04:26.CHASE>
  <[A.BBN.COM]29-May-86.09:56:39.CLYNN>
  <[SRI-NIC.ARPA]30-Jun-86.04:59:52.STJOHNS>

There are lots more than this, but two things are clear:

  1) You can't realy on anything except the angle brackets.
  2) You aren't going to change every mailer/news-reader in
     the universe to use <1234@newsvax.foobar>.

If is is in consolation, these ID's give older versions of notes
a serious case of heartburn.

wunder

lwall@sdcrdcf.UUCP (Larry Wall) (01/20/87)

In article <825@mecc.MECC.COM> sewilco@mecc.UUCP (Scot E. Wilcoxon) writes:
>Aren't we talking about an awful lot of inodes here?

No.  These are links to already existing articles.  Only the new subdirectories
would need new inodes.

Larry Wall
{allegra,burdvax,cbosgd,hplabs,ihnp4,sdcsvax}!sdcrdcf!lwall

hansen@pegasus.UUCP (Tony L. Hansen) (01/21/87)

Summary:


< Tony is right that the news software no longer needs the history file.
< However, I've found that I, as the news administrator, sometimes do.
< I have found it useful in determining whether an article has arrived,
< which newsgroup(s) it can be found in, etc.  Also, when one of my neighbors
< has had a glitch where they have lost a day's news, but they have multiple
< feeds, I've found it useful for creating a file of message-ids that
< arrived at my site during the period in question.  That file of message-ids
< can be easily packaged into an "ihave" control message.  So, I guess I'd
< prefer to have the history file around, at least as an option which I
< could select.

(This discussion concerns a system which does not use the DBM library
option.)

The contents of history can always be recreated by doing:

	cat LIBDIR/history.d/?

Oh, they need to be sorted by time of arrival before being looked at? Then
merge them back together with the '-m' option to sort:

	sort -m +1 LIBDIR/history.d/?

Want a list of all message id's?

	cut -f1 LIBDIR/history.d/?

Need to find article <116@falkor.UUCP>? Just look in LIBDIR/history.d/6, and
your search time will be 1/10th of the old search time. Just use the digit
before the '@' sign to find the history file to look in. If there is no
digit or no '@', then use file 0.

The only things which I've found to be not as easy are:

	1) what time did the last article arrive?

	ls -l LIBDIR/history
		becomes
	ls -lt LIBDIR/history.d

	2) what were the last few articles which came across?

	tail LIBDIR/history
		becomes
	for I in LIBDIR/history.d/?;do tail $I;done | sort +1 | tail

I've been using this code for over a year now (it was first installed into
2.10.3), and I appreciate not having the extra 2000 blocks wasted space.

					Tony Hansen
					ihnp4!pegasus!hansen

jerry@oliveb.UUCP (Jerry F Aguirre) (01/21/87)

In article <7529@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Eliminating the history file is, on the whole, a silly idea.  Precisely
>what benefits is the change supposed to produce?  If your news system
>lets articles get filed without putting them in the history file, this
>is a defect of the implementation, not the concept.  Ditto for foulups in
>coordination between inews and expire.  (What makes you think that the
>implementation of the new concept will be any better?  Of course, it won't
>foul up in precisely the same way...)

The map is not the territory.  Whenever you keep a separate index to
something there is the potential for the index to not agree.  With the
index built into the data there is less potential for disagreement.  In
this case we have three different versions, the history file, the dbm
copy, and the articles themselves.

I see a constant series of complaints on the net about the history file
being out of sync.  The classic case seems to be running out of disk
space which causes the history file to wind up zero length.  It is not
quite clear to me what changes to the news software could reliably
handle the problem of running out of disk space.

With no history file the potential for duplicates of the same article
arriving can cause even more disk space problems as well as user
annoyance.

>The assumption that article-ids always have reasonable formats is terribly
>naive.  Have any of the people suggesting this done *systematic* *surveys*
>of article-ids?  I thought not.  The vast majority of article-ids are in
>simple and reasonable format; it's the not-insignificant minority that
>aren't that causes trouble.  (I speak from experience.)

The format of the message IDs and the difficulty in translating them
into filenames is not a serious problem.  It is trivial to design one
that can handle any character value, even non-ascii characters.  For
example: take the first character and split it into two hex digits, make
directories for those names (00 - ff).  Repeat this with the next
character until the number of files per directory is reasonable.

As far as characters for the final file name, Unix allows any characters
in a filename except the '/' (and of course null).  Message IDs are
spec'ed to allow any character except the blank and null.  Translate any
'/' into blanks and you have a legal Unix file name.  Length of the ID
is more of a problem and could be handled by taking the first N (14)
characters, making a directory of that name, etc.

Non-Unix systems would have different restrictions but the same
principle could be applied.  For instance VMS systems have a more
restricted filename character set but do not suffer from the directory
length problem.  An extension of the translate characters to hex
strategy could be used.

If the article were created without write permission (mode 444) then the
create itself provides all the (portable) locking necessary to prevent
two "simultaneous" receptions of the same article.

As far as overhead, it is hard to understand how a single call to create
a file can have a prohibitive overhead.  That is all that is required to
check the "history" AND make the new entry.  As a directory need only be
made for each N files (say 100) the overhead for that can not be
significant.  I have heard some mention of "searching" directories for
each received article.  I don't understand what prompted someone to
think this.  How, exactly, does the proposed method differ from the
current method of creating an article (and possibly the directory for
it)?

I think that most of the same arguments could be made about the current
hierarchy for storing the news.  Right now the news software uses links
for posting to multiple news groups when it could just as easily use
index files to cross post.  I am no expert on "notes" but doesn't it use
that kind of scheme.  I seem to remember postings arguing the same kind
of "advantages" for the notes system as are being given for the history
file.

					Jerry Aguirre
					Olivetti ATC

sewilco@mecc.MECC.COM (Scot E. Wilcoxon) (01/22/87)

In article <3831@sdcrdcf.UUCP> lwall@sdcrdcf.UUCP (Larry Wall) writes:
>In article <825@mecc.MECC.COM> sewilco@mecc.UUCP (Scot E. Wilcoxon) writes:
>>Aren't we talking about an awful lot of inodes here?
>No.  These are links to already existing articles.  Only the new subdirectories
>would need new inodes.

I apparently was trying too hard to be brief.  I meant for the expired
articles, not just the current ones.  Many sites seem to keep articles
for 3-7 days, and history entries for 1-4 weeks.  As previous posters have
said, an expired article would have to keep some kind of 'file' for a
Message-ID entry until the 1-4 week period expired.

I have since realized that all expired articles could point to the same
file, perhaps one simply saying "Expired Article."  Similar ideas are to
have one such file for each day ("Expires: somedate"), and for the
file to contain a list of Message-IDs which expire on that date (expire
just has to find the file, then delete everything on the list).  Remember
I'm talking about articles for which expire has already deleted the
text, and just needs to keep track of the Message-ID for a while.
-- 
Scot E. Wilcoxon   Minn Ed Comp Corp  {quest,dayton,meccts}!mecc!sewilco
(612)481-3507           sewilco@MECC.COM       ihnp4!meccts!mecc!sewilco
   
  "Who's that lurking over there?  Is that Merv Griffin?"

shor@sphinx.UUCP (01/27/87)

[]
Those of us at sites with limited disk, lots of users, and reasons 
for being other than news (though I can't remember what those reasons 
are ... ) find that keeping the history of expired articles around for
several weeks after the article itself has expired is pretty handy.  We
bounce a suprising number of duplicate articles.  I have no idea why
they aren't being caught by upstream sites.  Anyway, I think I would
continue to use the current version of news rather than converting to
one that didn't use a history file.

Perhaps we should be talking about better ways of encoding the same.
information.  My history files (with indices) take up over a half meg
of disk that could otherwise be used by talk.bizarre.
-- 
Melinda Shore                               ..!ihnp4!gargoyle!sphinx!shor
University of Chicago Computation Center         shor@sphinx.uchicago.edu

chris@mimsy.UUCP (02/06/87)

>In article <7529@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>>Eliminating the history file is, on the whole, a silly idea.

Eliminating it is indeed silly; replacing it may not be.  (Well,
yes, `eliminating the history file' might be a *part* of the
replacement operation....)

>>... (What makes you think that the implementation of the new concept
>>will be any better? ...)

In the words of that mortal bard, `Ay, there's the rub.'

In article <414@oliveb.UUCP> jerry@oliveb.UUCP (Jerry F Aguirre) writes:
>I see a constant series of complaints on the net about the history file
>being out of sync.  The classic case seems to be running out of disk
>space which causes the history file to wind up zero length.  It is not
>quite clear to me what changes to the news software could reliably
>handle the problem of running out of disk space.

	oldf = open("history", ...);
	newf = creat("history.new", ...);
	...
	/* write exclusively to newf, catching errors */
	/* finally: */
	if (close(newf)) ...	/* handle error */
	(void) unlink("history");
	if (link("history.new", "history")) ...
	(void) unlink("history.new");
	/* or: if (rename("history.new", "history")) ... */

The `handle errors' part is tricky; one could delay unlinking
expired articles until the entire history.new has been made
successfully, or simply wait for Intelligent help.  This also does
nothing for incoming articles.  But this is a conventional
database coherency problem, with conventional solutions.

>As far as overhead, it is hard to understand how a single call to create
>a file can have a prohibitive overhead.  That is all that is required to
>check the "history" AND make the new entry.

Name lookups are, in all Unix systems that I have seen, *the*
highest cost operation performed by the kernel.  In particular,
creating a link is hard even with cacheing, as the entire target
directory must be scanned to ensure that the file does not yet
exist.  Longer pathnames take correspondingly longer to translate,
so short message-IDs would be desirable.  In comparison, a history
file can be opened once by rnews, and one line written per article
stored with only a relatively cheap write() syscall.  (This is, of
course, only a factor if you batch articles.)

>As a directory need only be made for each N files (say 100) the
>overhead for that can not be significant.  I have heard some
>mention of "searching" directories for each received article.
>I don't understand what prompted someone to think this.

Expire.  Expire must gather information about every unexpired
article.  Without a history file, expire would have to read the
directory tree.

If you think you have a good replacement, try it.  You may find
it hard to beat C news, however.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
UUCP:	seismo!mimsy!chris	ARPA/CSNet:	chris@mimsy.umd.edu