[net.news.group] too many new groups

lauren@vortex.UUCP (Lauren Weinstein) (02/25/85)

If we keep creating new newsgroups whenever someone finds "more interest
than he/she expected" in a topic, we're going to kill Usenet
faster than really necessary.  For many topics, we find a fairly
small group of people who are interested in the discussion, and a MASSIVE
group of people who couldn't care less.  Yet the netnews goes pretty much
everywhere, regardless of the actual real interest.  For many of the 
topics now being proposed as groups, the use of coordinated mailing 
lists could be far more efficient both in terms of delivery time and costs.

If we gave the matter some thought, we might find that some of the
backbone sites, which have to bear the brunt of mass netnews postings,
might be willing to help support some mailing lists.  These might
obviate the need for new newsgroups being created whenever someone finds
a new collection of people who want to discuss a particular topic.

The current netnews philosophy, which basically amounts to
"send everything, everywhere, and damn the cost" in most cases,
is becoming increasingly self-destructive.

--Lauren--

paul@mogwai.UUCP (Paul H. Mauritz) (02/26/85)

> If we keep creating new newsgroups whenever someone finds "more interest
> than he/she expected" in a topic, we're going to kill Usenet
> faster than really necessary.  For many topics, we find a fairly
> 
> If we gave the matter some thought, we might find that some of the
> backbone sites, which have to bear the brunt of mass netnews postings,
> might be willing to help support some mailing lists.  These might

 Bravo Lauren.  As an additional note, I was a member of a group
 proposing a new news group.  One thoughtful person suggested that we
 instead start a mailing list, which would include all articles which
 people would like to submit.  One person volunteered as coordinator -
 NOTE:  I said coordinator, not moderator.  he/she simply mails out to
 all interested people all the stuff he recieves on the topic.  It has
 been around for several months, and is working out fine - in my
 opinion.

 Mailing lists could effectively - and less expensively - replace a
 large number of the current news groups.

-- 
Paul H. Mauritz - Digital Equipment Corporation

UUCP:   decvax!grendel!paul
ARPA:   grendel!paul@seismo.ARPA
AT&T:   (301) 459-7956
USPS:   8301 Professional Place, Landover MD USA 20785, MS-DCO/913

"We do not inherit the world from our ancestors, we borrow it
 from our children."

hokey@plus5.UUCP (Hokey) (02/28/85)

I would almost think it is time to re-examine the possibilities for a
keyword-based news system.  We could create all the keywords we wanted,
and just come up with an aliases database to handle the similarities.

Sure would make things different!
-- 
Hokey           ..ihnp4!plus5!hokey
		  314-725-9492

chuqui@nsc.UUCP (The Phantom) (03/03/85)

In article <614@plus5.UUCP> hokey@plus5.UUCP (Hokey) writes:
>I would almost think it is time to re-examine the possibilities for a
>keyword-based news system.  We could create all the keywords we wanted,
>and just come up with an aliases database to handle the similarities.
>
>Sure would make things different!

My opinion would be that it would make life worse, not better. I actually
have some 'facts' to base this comment on.

I have been experimenting with a new archiving system for netnews-- I've
needed it because I have something like 55 Megabytes of usenet archive from
various groups that I'd like to find things in once in a while (as you
might imagine, 'grep' doesn't really cut it on things of that size.) since
there IS a lot of good information out there, somewhere.

I started looking at this last spring, when I realized I was going to need
it (this was about when 'grep <word> /usr/spool/oldnews/net/unix-wizards/*'
failed from too many file names. I finally got a design I thought was
workable, which got posted as an RFC to unix-wizards and net.news soemtime
around christmas (I got no comments back, BTW). I've been playing with
various parts of the design as prototypes since, looking at the design
versus the reality, and trying to decide whether or not will work. The
answer, for now, is that it won't.

Here are the problems I ran in to-- To some degree I'm simplifying, to some
degree I've probably missed something obvious. Comments are welcome
(preferably by mail) since I'm still trying to build an archiver...

    My database is 50+ meg, and growing by 1-2 Meg a week. This is
    a LOT of data, and a lot of it is out of date, duplicated, or not
    exceptionally useful. Excepting net.sources, it is broken into tiny
    pieces (somewhere between 1000 and 3000 bytes seems to be the average)
    and so you have a LOT of files (estimate left as an excercise for the
    reader). 
    
    Implementing a lot of these things on a standard news system
    means it hits a lot LESS data, but you also need to worry about
    expiration of data, a whole new ballgame. 

    One strong requirement for a keyword lookup is speed, which means that
    the keyword database much be created before it is needed. For me, this
    meant when an article was added to the archive, for news, when it is
    given to rnews (or whatever replaces it). Rnews is already a hog. 
    Generating keywords is not what I would call a trivial exercise, and
    while I found a fairly elegant way of doing it we are still talking
    about a fair amount of cycles.

    We are also talking about a fair amount of keywords. Dbm is out-- it
    doesn't like multiple keys, which means you get to figure out some way
    of storing them for quick access. You also need to store them, which
    takes space. I just took a quick look at my system: Unix-wizards has
    370 articles, singles has 312, jokes has 530, flame has 449, politics
    has 528 and music has 385. That is about 2574 articles in 6 of the
    larger groups. My history file shows 12,000 articles total. I was
    generating about 10 keywords average on an article by pulling
    significant words out of the first block of the article past the header
    (the Keywords: line is either ignored or used for 'cute' words often
    enough that I stopped playing with it for now). Assume 5 characters for
    an average word. That gives us, just for keyword storage, 600,000K
    characters, and there is no information about where those keywords
    point to. So, add a filename 'unix-wizards/10043' for another 20
    characters- giving us 25 characters an entry, or about 3 megabytes. 

    What this means is we need to create, maintain, and update a 3 megabyte
    keyword database using a history file like format (each keyword in 
    each article has a line in a file of the form '<keyword> <path>'. That
    is about as compact as you will get, but lookup is SLOW (linear
    searching) and record deletion is essentially impossible without
    regenerating the data base (anyone seen how much time expire already
    takes? *grin*).

    What you end up doing is setting up a database of databases keyed on
    keywords, which helps lookup speed but not the deletion problem.
    Perhaps a set of dbm files, a single dbm file for each keyword will
    work, but you run into the problem that the standard dbm only
    understands one file. mdbm is a solution, if it can be made generally
    available. The other problem with this is that the more you do to make
    the data flexible, the more you are going to make the data spread out,
    taking up more disk. If you go to a full fledged dbm scheme, you'll be
    lucky to store your three megabytes in less than 6 or 7. As far as I
    can tell, to get a system that will allow fast keyword lookup, a
    reasonably fast deletion for expire, and allowing the overhead to be
    taken in by rnews, you are looking at something like an added 10-15% 
    of CPU overhead to rnews for the keyword generation and database setup,
    about 5% to expire, and a 50% increase in disk storage. I simply can't
    afford that right now.

    An added problem is simply the one that keywords are highly dependent
    upon the submitter. What is f77 to one person is fortran to another,
    ftn to a third, fortrash to a fourth, and obsolete to a fifth. This
    makes finding something in a keyword database a bit of a game-- you
    look for an obvious word, which gives you hints towards other words
    (maybe), and so on. I found myself running four and 5 iterations to
    find things, and either getting a lot of unrelated stuff (except
    incidentally) or not getting what I'm looking for. Example: A while
    back there was a discussion in unix-wizards about Vaxen greater than 8
    Megabytes, and the software changes needed to make it work. lookup
    megabyte. No luck. Lookup 8. you'll be there for months. lookup meg, or
    mbyte, or vax or vaxen or memory-- I KNOW that those articles are in
    there somewhere, but I never DID find them. So the ultimate usefulness
    of a keyword system lies in having a set of defined and used keywords.
    Trying to find things in the prototypes I hacked through simply
    proved to me that we don't have an agreement on the words we use for
    various things (you should see the number of words used to describe the
    C compiler, or f77, for example...) It is a lot like finding a store in
    the yellow pages when it isn't listed under what you would expect it to
    be (racquetball? Hmm.. how about sporting good? retail stores? health
    food?)

    I still have hopes I can do something for the archiver that doesn't
    require a dedicated 750 and an eagle to do it with. I don't know--
    comments are welcome. Keywords sound like a nice thing to do-- a lot
    like mice, pop-up menus, user friendly software, distributed systems,
    and other buzz words. In the news environment, I can't find what I
    would consider a workable system-- perhaps we need to rethink news
    completely, but I'm not up to that one. It certainly looks to me like
    we can't just add it onto the side of what we have.

chuq


-- 
From behind the eight ball:                       Chuq Von Rospach
{cbosgd,fortune,hplabs,ihnp4,seismo}!nsc!chuqui   nsc!chuqui@decwrl.ARPA

We'll be recording at the Paradise Friday night. Live, on the Death label.