mark@sickkids.UUCP (Mark Bartelt) (12/05/88)
In article <115@sickkids.UUCP> mark@sickkids.UUCP (Mark Bartelt -- oh, yeah, that's me, no wonder the name sounded so familiar) writes: [ Complaints about messages reappearing, with new Message-IDs, and various other header lines changed as well ] In article <10573@s.ms.uky.edu> david@ms.uky.edu (David Herron) replies: [ Explanation about Usenet=>MailingList=>Usenet gatewaying, and such ] Several other people responded via e-mail, with essentially the same explanation. Many suggested that it wasn't a major enough issue to be worth worrying about. Being somewhat curious as to how much of this duplication there really was, I set a quickie shell script to work, finding all duplicate articles in all the newsgroups that we receive. In case anyone cares, here are the results ... Of the 180 or so groups that we get, only 28 had any duplicates at all. (This is based on about four weeks of unexpired netnoise.) The worst offenders were the following: 100.0 comp.sys.ibm.pc.digest 37.2 comp.society.women 29.4 comp.protocols.ibm 18.3 comp.sys.next 12.8 comp.archives 9.1 bionet.general The numbers preceding the newsgroup name represent the increase in number of articles attributable to article duplication. As we can see, we have a clear winner (loser?) here. Of the 18 articles in comp.sys.ibm.pc.digest at the time this ran, only 9 were "real" ones. Six were second copies of some of the "real" nine. One was a *third* copy of one article! And two were parts 1 and 2 of an article (which was already there twice in its non-chopped-up form) that someone cut into two pieces for transmission to god-knows-where. Unfortunately, I didn't have the foresight to have the script keep track of the space wasted by article duplication. (And I don't feel like running it *again*, as it cranked along for seven hours or so on my poor old VAX/750.) But, assuming that the duplicated articles are average in size (for the newsgroup in which they appear), a rough estimate is 600-700kb wasted. And since the script looked only for true duplicates (i.e. articles identical except for the headers), it missed things like the INFO-UNIX and UNIX-WIZARDS digests appearing (and duplicating many articles in) comp.unix.{questions,wizards}. So, to add a fudge factor for this sort of thing, I'd guess a megabyte or thereabouts is being wasted, out of 60mb or so that all the articles on the system occupied at the time the script was run. In other words, not a big deal, but perhaps something that should be watched (and, one hopes, eventually fixed), since a couple truly-out-of-control gateways between netnoise and mailing lists could, conceivably, clutter up our disks up with all manner of duplicate rubbish. By the way, if I see this article reappear a second time, does that mean that I'm really Douglas Hofstadter? Mark Bartelt UUCP: {utzoo,decvax}!sickkids!mark Hospital for Sick Children, Toronto BITNET: mark@sickkids.utoronto 416/598-6442 INTERNET: mark@sickkids.toronto.edu