[news.admin] Header munging, duplicate articles

mark@sickkids.UUCP (Mark Bartelt) (12/05/88)
In article <115@sickkids.UUCP> mark@sickkids.UUCP (Mark Bartelt --
oh, yeah, that's me, no wonder the name sounded so familiar) writes:

[  Complaints about messages reappearing, with new Message-IDs, and
   various other header lines changed as well  ]

In article <10573@s.ms.uky.edu> david@ms.uky.edu (David Herron) replies:

[  Explanation about Usenet=>MailingList=>Usenet gatewaying, and such  ]

Several other people responded via e-mail, with essentially the same
explanation.  Many suggested that it wasn't a major enough issue to
be worth worrying about.  Being somewhat curious as to how much of
this duplication there really was, I set a quickie shell script to
work, finding all duplicate articles in all the newsgroups that we
receive.  In case anyone cares, here are the results ...

Of the 180 or so groups that we get, only 28 had any duplicates at
all.  (This is based on about four weeks of unexpired netnoise.)
The worst offenders were the following:

	100.0  comp.sys.ibm.pc.digest
	 37.2  comp.society.women
	 29.4  comp.protocols.ibm
	 18.3  comp.sys.next
	 12.8  comp.archives
	  9.1  bionet.general

The numbers preceding the newsgroup name represent the increase in
number of articles attributable to article duplication.  As we can
see, we have a clear winner (loser?) here.  Of the 18 articles in
comp.sys.ibm.pc.digest at the time this ran, only 9 were "real" ones.
Six were second copies of some of the "real" nine.  One was a *third*
copy of one article!  And two were parts 1 and 2 of an article (which
was already there twice in its non-chopped-up form) that someone cut
into two pieces for transmission to god-knows-where.

Unfortunately, I didn't have the foresight to have the script keep
track of the space wasted by article duplication.  (And I don't feel
like running it *again*, as it cranked along for seven hours or so
on my poor old VAX/750.)  But, assuming that the duplicated articles
are average in size (for the newsgroup in which they appear), a rough
estimate is 600-700kb wasted.  And since the script looked only for
true duplicates (i.e. articles identical except for the headers), it
missed things like the INFO-UNIX and UNIX-WIZARDS digests appearing
(and duplicating many articles in) comp.unix.{questions,wizards}.  So,
to add a fudge factor for this sort of thing, I'd guess a megabyte or
thereabouts is being wasted, out of 60mb or so that all the articles
on the system occupied at the time the script was run.  In other words,
not a big deal, but perhaps something that should be watched (and, one
hopes, eventually fixed), since a couple truly-out-of-control gateways
between netnoise and mailing lists could, conceivably, clutter up our
disks up with all manner of duplicate rubbish.

By the way, if I see this article reappear a second time, does that mean
that I'm really Douglas Hofstadter?

Mark Bartelt                          UUCP: {utzoo,decvax}!sickkids!mark
Hospital for Sick Children, Toronto   BITNET: mark@sickkids.utoronto
416/598-6442                          INTERNET: mark@sickkids.toronto.edu