jc@minya.UUCP (John Chambers) (08/29/87)
Say, I was wondering if anyone out there could explain something that has had me puzzled for a while. Usenet has, as well known, this "history" file, and in addition, on this Sys/V machine, it has a history.d directory that contains files 0 thru 9. Each of these is a subset of "history", and this sorta makes sense; as the install doc describes it, this is to speed up history lookups. However, the software as compiled here insists on keeping *both* the large history file and the 10 little ones. This strikes me as a bit on the bizarre side. I mean, isn't that getting the worst of both worlds? Poor rnews is looking everything up in both of them, and updating both. It seems that just using the one large file would save a whole lot of time. The only other program that regularly grovels in these files is expire, and it does both of them, too. When I first installed usenet, I figured that I'd keep it simple and just use one big history file. One of my first puzzles was why each line was there twice. I examined the code, and discovered that it quite intentionally does this if the history.d files are missing. So I created history.d/0, history.d/1, and so on, and it now writes the two lines to two different files. OK, I thought, perhaps it is like the log file; perhaps you don't have to have the big history file at all. So I deleted it. (Well, actually, in my paranoia, I merely renamed it; good thing, too...) The next morning I found I had a mail message saying that there was a problem with the history file, and it had been fixed. I investigated, and sure enough, "history" was back, with all the new articles in it. On further investigation, I found something rather startling. All the files in history.d had been truncated, discarding all the history, and they also contained only the new entries. Now, this really strikes me as inexplicable. According to the install doc, the history.d files exist to speed things up. Instead, they seem to be a totally redundant history, the software insists that it *will* write two copies of everything, and if I try to remove things, it reaches out and garbages the rest. Considering that the entries aren't sorted, so two linear searches are being done thru some very large files, this seems like hardly a nice way to do things. Am I totally mistaken, or is this really supposed to be as wasteful of the machine's cycles and disk space as it appears? Or did I perhaps do something stupid, and some obscure tweak will make it all work sanely? If so, I guess I'm not smart enough to figure it out from the doc, so someone out there maybe oughta explain it before I go in and wreak some bodily harm on the poor unsuspecting code. (It'd be easy enough to get rid of all the history.d code.) Or has this perhaps been corrected somewhere and I just didn't notice? (I admit that, because patch does some very bizarre things with some of the patches, producing code that is totally uncompilable, I have installed only a couple of them so far.) Bye for now. -- John Chambers <{adelie,ima,maynard}!minya!{jc,root}> (617/484-6393)
henry@utzoo.UUCP (Henry Spencer) (08/31/87)
> Say, I was wondering if anyone out there could explain something that > has had me puzzled for a while. Usenet has, as well known, this "history" > file, and in addition, on this Sys/V machine, it has a history.d directory > that contains files 0 thru 9. Each of these is a subset of "history"... > However, the software as compiled here insists on keeping *both* the > large history file and the 10 little ones. This strikes me as a bit > on the bizarre side. I mean, isn't that getting the worst of both > worlds? ... What you're seeing is a SysV emulation of something that makes a bit more sense in non-SysV Unixes. In the non-SysV world, there's a little library called "dbm" which lets one maintain sort of a symbol table. The point of this is that news can use dbm to find out *quickly* whether an article with a given message-id exists on the system already. Using dbm is *much much* faster than scanning the whole history file, especially in non-SysV systems that historically have rather slow implementations of fgets. The history file provides the authoritative record; dbm's role is to provide fast access. It's worth the price of having to keep both up to date. Unfortunately, there's no dbm in SysV, so it was faked. I don't know whether anybody looked carefully at the tradeoffs in SysV, although the 0-9 split should be cutting search overhead noticeably. The point is that it's not really a duplication of effort -- the main history file and its subsidiaries are specialized for different uses. -- "There's a lot more to do in space | Henry Spencer @ U of Toronto Zoology than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry
rick@seismo.CSS.GOV (Rick Adams) (09/01/87)
Yes it is a bit bizarre to keep two copies of the history file. This was fixed in patch #4 (we're up to 8 with #9 coming soon). --rick