[news.misc] keyword-based news

brian@ucsd.EDU (Brian Kantor) (09/07/88)

In some analysis attempts to characterize the dimensions of the problem
of a keyword-based news system, I looked at a week's worth of news,
omitting the sources groups.

In 6 days of articles, there were over 20,000 articles on file, containing
100,000+ unique strings of which only 17,000 were to be found in the 4.3BSD 
spelling dictionary of 25,000+ words.  The average article-id is 22 
characters long.  (All numbers are rounded.)

Thus the keyword index is going to be a rather large database, even with
nonsense and trivial words filtered out.

A quick back-of-the-envelope calculation says that we could fill up one
or two WORM platters a year holding the index and articles.

Yeah, I know, lots of ways to compress, filter out headers, signatures,
etc. etc. etc.  None of those will do more than cut the problem in half.
For a while.  Maybe I can inspire some student....

And that's as far as I'm going to go with it for now.  Just thought
you'd like to see some numbers.

	Brian Kantor	UCSD Office of Academic Computing
			Academic Network Operations Group  
			UCSD B-028, La Jolla, CA 92093 USA
			brian@ucsd.edu ucsd!brian BRIAN@UCSD

brad@looking.UUCP (Brad Templeton) (09/10/88)

When I created my concept of keyword based news, I never meant a
arbitrary keyword search as you might find with on-line databases.
(Indeed, doing that does not involve changing the structure of news at all,
you could write that sort of search program for today's articles.)

Perhaps "keyword" news was a bad choice of name.
What I really meant was highly classified news, and "keyword" seemed as
good a name as any for the new classifications.

The KNEWS concept (can't call it C news!) involves switching from around
500 categories (newsgroups) to several thousand, arranged in an easy to
understand hierarchy.  This gets combined with a nice pattern matching
language to select articles based on categories and other parameters.

I still would love to see it happen, but who has the time.  So many newsgroups
are just too big to read these days.
-- 
Brad Templeton, Looking Glass Software Ltd.  --  Waterloo, Ontario 519/884-7473

brian@ucsd.EDU (Brian Kantor) (09/10/88)

Brad:

Good heavens, I wasn't criticising your KNEWS proposal - for the
excellent reason that I never heard (saw?) it.  It does seem sound, but
somewhat impractical, given the difficulties in changing the installed
base of existing news systems.

Maybe on the "new usenet" (gag, choke)?

No, those numbers and the resulting article came out of some blue-sky
dreaming several of us here at UCSD were doing one day, and since it
took our otherwise-unoccupied 3B15 nearly half a day to generate them,
I thought I'd post them in case no one else had had the time to play.

Seriously, the model we intended was along the lines of MEDLINE and
other on-line databases --- something we could add to the >existing<
Usenet news system to make nifty retrieval possible.  It actually
may be used in some modified form by our library people, who are
asking about some way to do simple yet broad classification of news
(of the AP, UPI, and Reuters variety, not necessarily Usenet).

Simply storing the articles and building the keyword index is a
relatively trivial matter, except for the sheer quantity of data
involved.  It is the ability to perform set operations on retrieved
indices that is interesting, and it may make a nice one or two-quarter
undergraduate project for some independent study student here.

I'll let you know if anything ever comes of it.

	Brian Kantor	UCSD Postmaster & Chief News Weenie
		UCSD Office of Academic Computing
		Academic Network Operations Group  
		UCSD B-028, La Jolla, CA 92093 USA
		brian@ucsd.edu	BRIAN@UCSD ucsd!brian