[net.news] keyword-based news

lauren@vortex.UUCP (Lauren Weinstein) (09/30/85)

For quite a few years, I've been using a very elaborate keyword-based
system for searching a large newswire story database.  This database
is in a centralized location so there is no concern about COSTS associated
with extra matches, unlike the Usenet situation.

One thing I learned long ago thanks to this system--it is almost
IMPOSSIBLE to avoid major missed matches AND extra matches.  If you
try to make your keyword choices very specific and negate out topics
of no interest, you frequently (*VERY* frequently) find that you're missing
great numbers of stories that you really DID want to see, but where
a particular keyword you specified wasn't used.  Or you find that *MANY*
stories you wanted to filter OUT still get through since the keywords
you wanted to SKIP weren't used.  There are so many similar ways to specify
keywords, and there are so many personal choices involved, that getting
the proper match between the person choosing the article keywords and the
person trying to find (or ignore) particular stories is very difficult.

In a keyword-based news system, with users attempting to choose
their own keywords (and probably spelling them wrong part of the time,
or leaving typos in them, let's face it!) getting CORRECT matches without
getting lots of ERRONEOUS matches would be a nightmare.

Let's say I wanted to see all stories that discussed TELEPHONES.
But what if a story about AT&T was only keyworded with "PHONES"
or "COMMUNICATIONS"?  Well, you of course never see those stories.
The same sort of problems can occur in the reverse direction when
you're trying to avoid certain stories.  It is VERY hard to create
flexible keyword-based systems that avoid these problems.  The
issues involved with parts-of-speech and word usage alone are
very significant.  Even the advanced systems won't match on PHONE
when you want TELEPHONE... there are infinite similar examples.

Even if you're willing to sit for five minutes trying to figure out all
the "correct" keywords for a article when you submit it, you still frequently
make personal choices that are not going to match another person's 
veiw of that same article.  Two people will tend to keyword any given
article in different ways.  This means that matching is a serious problem.

Before people jump on the keyword bandwagon, I STRONGLY suggest that
some time be spent looking at the numerous problems with existing
keyword-based systems, such as DIALOG.  I've used that service quite
a bit, and it is very, very frustrating to wade through lots of junk
you didn't want, and miss items you did want, due to keyword
"mismatch" problems of various sorts.  For netnews sites trying to cut
back on the phone bills by only sending, for example, technical
items, the volume of erroneously matched stories could be massive.
The odds are that about half the stories that would be sent would 
be "incorrect" and that about half of the stories you WANTED to send
woudln't get sent.

There is a lot of existing research in keyword systems that the proponents
of keyword-based news seem to be ignoring.  My own opinion is that
in our distributed environment, with volumes of material and costs
going up steadily (and many sites faced with cutting back on both,
one way or another) keyword-based systems might make our current
mess look like a paradise by comparison.

--Lauren--

chuqui@nsc.UUCP (Chuq Von Rospach) (10/02/85)

In article <820@vortex.UUCP> lauren@vortex.UUCP (Lauren Weinstein) writes:
>For quite a few years, I've been using a very elaborate keyword-based
>system for searching a large newswire story database.  This database
>is in a centralized location so there is no concern about COSTS associated
>with extra matches, unlike the Usenet situation.
>
>One thing I learned long ago thanks to this system--it is almost
>IMPOSSIBLE to avoid major missed matches AND extra matches.  If you
>try to make your keyword choices very specific and negate out topics
>of no interest, you frequently (*VERY* frequently) find that you're missing
>great numbers of stories that you really DID want to see, but where
>a particular keyword you specified wasn't used.  Or you find that *MANY*
>stories you wanted to filter OUT still get through since the keywords
>you wanted to SKIP weren't used.

Lauren has a point, but if this system is like all of the other newswire
searching systems I've seen it has limited applicability to a keyword based
news system. The problem is that doing keyword searches on a general
database IS going to bring forward lots of silly matches because the words
just happen to be used in otherwise unrelated articles. What I'm planning
on doing for NNTN, though, is to have the author attach the appropriate
keywords to the article. Rather that simply grepping text for the words,
you look at only the keywords the author thinks is important. Even if the
author is completely incompetent with this keyword selection this should
keep the accidental matches down to a minimum. You can't ignore the
problem, but you also have to realize that Lauren's example is to a great
extent an Apple and Orange comparison to USENET and its problems. 
-- 
:From under the bar at Callahan's:   Chuq Von Rospach 
nsc!chuqui@decwrl.ARPA               {decwrl,hplabs,ihnp4,pyramid}!nsc!chuqui

If you can't talk below a bellow, you can't talk...

lauren@vortex.UUCP (Lauren Weinstein) (10/04/85)

First of all, Chuqui, I noticed that you ignored the second part of
my argument, where I pointed out how limited or poor keyword choices
result in many MISSED articles.  Ya see, that's the problem with 
keyword systems.  Put in too many keywords, or "inappropriate" ones, and
you get all sorts of mismatches.  Put in too few, or (once again)
"inappropriate," ones and you miss most of the articles you really wanted
to see.  And both these points apply both to the person choosing the 
keywords to go with the article AND to the person searching for articles
of interest.  In other words, there are four different modalities
of screwup in such systems, plus combinations, of course.

Ya' want something with greater applicability to netnews?  OK, try
DIALOG or any of the other large commercial database systems where
keywords are assigned on a carefully organized basis, and are kept
fairly limited to (supposedly) *try* avoid many mismatches.  They still
are horribly mucked up.  It takes a great deal of real skill to choose
correct keywords (either when posting an article or searching for one).
And even with skill and practice, you end up with piles of junk
AND missing items of real interest.

Many of the commercial services have people who do nothing all day
but read articles and spend a great deal of time assigning keywords
that will hopefully maximize correct matches.  You know what?  You
STILL get floods of useless matches (you'd be amazed) and you still
miss 80% of the stuff you really wanted.  And that's with pro's spending
lots of time choosing appropriate words, not some frazzled netnews
user trying to dash something off in a hurry.

I'm sorry Chuqui, but I've used lots of keyword systems (both commercial
and non-commercial in all sorts of different applications) and I consider
them to be a real mess.  Even very elaborate, sophisticated systems 
are a royal pain to use.  And most of these systems don't have the
additional consideration of trying to decide what material they
can afford to pass on to other sites, and of avoiding mushrooming
of discussions into all sorts of sidetracks that can massively
increase costs.  In other words, keyword systems tend not to work
very well even in centralized environments where costs are not
a significant factor.  In our distributed environment, keywords
cannot replace newsgroups without causing an immense amount of
waste, hassle, and increased costs.  Depending on keywords also brings
forth the problems of unwanted and missed articles discussed above.
In a time when many sites are faced with either limiting traffic or 
dropping off the net entirely, keyword systms, apart from the hassles they 
cause the users ("pros" as well as casual users) could make attempts at
thoughtful traffic limitations impossible, and the result could
be the loss of hub sites and many other sites as the traffic continues
to grow.

--Lauren--

mjl@ritcv.UUCP (Mike Lutz) (10/05/85)

In article <820@vortex.UUCP> lauren@vortex.UUCP (Lauren Weinstein) writes:
>For quite a few years, I've been using a very elaborate keyword-based
>system for searching a large newswire story database ...
>
>One thing I learned long ago thanks to this system--it is almost
>IMPOSSIBLE to avoid major missed matches AND extra matches.

Lauren, as usual, is right on the money.  This problem is known to the
Information Retrieval folks as precision vs. recall.  Precision is the
fraction of retrieved items that are relevant; recall is the fraction
of relevant articles retrieved.  As Lauren has noted, you generally
have to trade one off against the other.  And this ignores entirely the
subjective nature of 'relevancy.'

There's been a lot of work in this area (see, for example, the evolving
SMART system from Salton's group at Cornell).  However, the incremental
CPU time needed to make even small gains in both precision and recall
can be staggering.  When combined with the volume of database updates
represented by a day's worth of news, I don't see how the use of
keywords at the the transmission level is practical.
-- 
Mike Lutz	Rochester Institute of Technology, Rochester NY
UUCP:		{allegra,seismo}!rochester!ritcv!mjl
CSNET:		mjl%rit@csnet-relay.ARPA

chuqui@nsc.UUCP (Chuq Von Rospach) (10/05/85)

In article <825@vortex.UUCP> lauren@vortex.UUCP (Lauren Weinstein) writes:
>First of all, Chuqui, I noticed that you ignored the second part of
>my argument, where I pointed out how limited or poor keyword choices
>result in many MISSED articles.

Actually, I didn't ignore it, I simply didn't respond to it because I
didn't feel adding more needless postings to the issue will get us almost
as far as the discussions about domains have in net.mail -- nowhere.

If I spend my time refuting every article on the net on a point by point
basis, all I'd ever do is refute articles on a point by point basis. Where
I think a clarification is neccessary, I'll clarify. Where I think someone
is off base, I'll try to explain the concepts better. When a discussion
turns religious, I'll bow out because I'd rather finish the design and get
it implemented and SEE if it works. We can argue forever on the subject
(since you seem to be against it and I seem to be for it) without
convincing each other or anyone else, and the end result is a lot of words,
a few bruised egos, and absolutely no results. I'd rather at least TRY the
silly thing and if it doesn't work get rid of it again. I don't ignore
stuff, Lauren, I just don't always respond to things because I don't want
to waste the net's time and volume arguing aimlessly or reiterating things
for the 400th time...

>Put in too many keywords, or "inappropriate" ones, and
>you get all sorts of mismatches.  Put in too few, or (once again)
>"inappropriate," ones and you miss most of the articles you really wanted
>to see.  And both these points apply both to the person choosing the 
>keywords to go with the article AND to the person searching for articles
>of interest.  In other words, there are four different modalities
>of screwup in such systems, plus combinations, of course.

Those modalities exist in the current news because of the way newsgroups
are set up (newsgroups are just a funny flavor of keyword with very little
flexibility, hooks into the database layer (sigh) and a distribution
tacked on). If I don't solve that problem (and I very well may not) but
simply carry them along into a new interface that gives you a new
flexibility and some added power (and/or be a bit easier to use and/or a
bit harder to misuse) what have we lost? If I can solve SOME of netnews
problems, I will be more than happy. If I can solve ALL of them, I'll run
for president or something. I think I have a chance to make things better.
I'm not even trying to make things perfect...

>I'm sorry Chuqui, but I've used lots of keyword systems (both commercial
>and non-commercial in all sorts of different applications) and I consider
>them to be a real mess.  Even very elaborate, sophisticated systems 
>are a royal pain to use.  And most of these systems don't have the
>additional consideration of trying to decide what material they
>can afford to pass on to other sites, and of avoiding mushrooming
>of discussions into all sorts of sidetracks that can massively
>increase costs.  In other words, keyword systems tend not to work
>very well even in centralized environments where costs are not
>a significant factor.  In our distributed environment, keywords
>cannot replace newsgroups without causing an immense amount of
>waste, hassle, and increased costs.

If you could prove your assertions, why don't we all just unplug our modems
and go home? If things really ARE that bad, Lauren, why haven't you just
split? Rather than simply doomsaying, why can't we recognize the problems
(something you seem quite good at) and try to make things better. We may
well fail horribly, but doing nothing guarantees failure, and doing
something may help the odds out.

If you spill some milk on the floor, how do you react? Do you slit your
throat for being a clumzy oaf, or you do go find your cat? Slitting your
throat keeps you from spilling it again, but it creates all sorts of
complications. If you find the cat, the spill can be cleaned up AND you can
keep your cat happy. I'd rather go off looking for my cat, thank you --
that knife looks terribly sharp...
-- 
:From under the bar at Callahan's:   Chuq Von Rospach 
nsc!chuqui@decwrl.ARPA               {decwrl,hplabs,ihnp4,pyramid}!nsc!chuqui

If you can't talk below a bellow, you can't talk...

lauren@vortex.UUCP (Lauren Weinstein) (10/08/85)

I almost couldn't stop laughing here when I saw Chuqui's article
claiming "he didn't want to waste the network's time with point-by-point
refutations of arguments."  Uh huh.  He has always seemed perfectly
happy to post long point-by-point refutation articles when he thought
he had some valid way to argue.  But hey, if he has nothing to say
on some of these points (whatever the reason), that's OK by me.

---

My arguments against keyword-based netnews are not new.  I've pointed
out these same issues whenever this topic has appeared (and it has
appeared in the past, and has been argued about in netnews, long
before Chuqui appeared on the scene).  I consider keyword-based netnews,
if implemented in the manner so far described, to be ANOTHER PROBLEM,
not a solution.  My opinions regarding the way to solve our
existing problems are pretty well known, and I think that most
of the network knows how I personally am proceeding.  I feel that
moderated groups and alternate services (e.g. Stargate) show the most
promise for providing useful services in the medium to long-run.
I feel that "solutions" that do not include some form of moderation
are stopgaps that will have no long-term usefulness as the network
continues to grow.

Keyword systems do nothing to help control our traffic and cost
problems.  In fact, they make it MORE likely that discussions will
mushroom off in bizarre directions (no pun intended) and make it HARDER
for individual sites to control what they wish to send or receive.
Keyword systems don't even necessarily help find items of interest,
due to the four error modalities already discussed in
previous messages.  Now, if someone wants to implement keywords
as an ADDITION to newsgroups (not to REPLACE newsgroups)
the problems are less serious--but the four error
modes still exist and will present major problems, especially
with everyone attempting to choose their own keywords OR automatic
systems trying to do so.  It takes pros to pick and manage keyword
systems, and this implies a level of centralization we simply don't
have and can't have in the Usenet environment.  And even with
pros, most of the same basic problems with keywords exist.  Perhaps 
in Stargate some centrally organized keywords will be possible--but even
in such a situation I would tend to discourage too much reliance
on keyword systems.

--Lauren--

preece@ccvaxa.UUCP (10/10/85)

> /* Written  9:16 am  Oct  5, 1985 by mjl@ritcv.UUCP in ccvaxa:net.news
> */ However, the incremental CPU time needed to make even small gains in
> both precision and recall can be staggering.  When combined with the
> volume of database updates represented by a day's worth of news, I
> don't see how the use of keywords at the the transmission level is
> practical.
----------
Well, you can visualize the problem more easily if you recognize
that the present newsgroup system is isomorphic to a keyword system,
with the name of the newsgroup being the keyword.  We get lots of
false hits because people cross-post without thinking about it
and we get lots of missed matches because people post things in the
wrong groups.  It's not clear to me that changing to an explicitly
keyword based system is going to have much effect on the human
failings that cause the problems.

Everything in life is a tradeoff.  Using keywords allows the author
to specify more closely (by selecting more keywords) what the posting is
about, but also makes the posting appear to be relevant to more topics:
the difference is whether you consider the keywords to be ORed together
or ANDed together.  The best results come from using small numbers
of keywords assigned from a large, carefully thought out vocabulary
with hierarchical relationships among index terms, but that kind of
vocabulary is hard to learn, easy to mis-use, and not well attuned to
change in usage over time (ask a librarian about de-superimposition).

There are also search systems based on associations between articles
or between articles and queries.  These offer a lot of promise,
especially when (as on Usenet) the full text of the items is available.
The user says "Article X is what I'm really interested in" and the
system finds all the documents that are "like" article X, or the
user provides a natural language description of the topic of interest
and the system finds all the articles that are similar to that
description.  Similarity is usually based on similar vocabulary,
weighted so that some words count more than others, or other common
factors.  Citation patterns are very strong tools for similarity, too.
Articles that share many citations or that are often cited in the
same place are very likely to be similar in content (don't bother
sending counter-examples, all of this is biased by the law of
large numbers).

It's not clear that these mechanisms, developed for searching a
retrospective collection of documents, are applicable to running
a newsletter, which is a better model of Usenet.  The user might
have specific discussions (tied together by citation) that were
assumed to be of continuing interest (any new items citing any
of a set of existing items would be displayed), but the basic
mechanism for viewing new material would have to depend on a set
of profiles of interest, which is probably too specific.  The
reader might very well want to read generally in the area of
feminism, with no more specific topic, and that kind of connection
is hard to make by association unless the author has specifically
tagged the item with a descriptor that can be placed in a
hierarchical subject space, which brings us right back to the
original problem -- that keyword might as well be the name of
a newsgroup.

Oh, one more problem with keyword based approaches: speed.  The
big database systems depend on inverted indexes: given a word, you
can get a list of all the items containing that word.  Maintaining
that kind of index for a database whose contents change daily would
be very expensive; doing without an inverted index would slow
the user interface to unusability.  The present system, of course,
has an inverted index: the list of newsgroups.

-- 
scott preece
gould/csd - urbana
ihnp4!uiucdcs!ccvaxa!preece

sewilco@mecc.UUCP (Scot E. Wilcoxon) (10/11/85)

Keywords are the wrong term.  What's really wanted is selection
based on the concepts or ideas discussed in an article, as well
as a few tags (adjectives?..announcement, question, response, etc).

The English language does not have standard names for all concepts,
as well as synonyms.  These are two reasons for the fuzziness of
keyword searches.  Most people would use "doorknob" instead of
"device which opens door", but English does not force either.
Even for this little message I could not use
	Keywords: keyword concept news thesaurus
and had to resort to
	Keywords: keyword concept idea news thesaurus

In an AI system, concepts would be nodes in a knowledge base, but 
I haven't heard of knowledge bases which are as wide-ranging as
the net is.

What really is necessary is a widely available index of concepts
which could be used by posters and moderators as a standard.
Roget's Thesaurus has one possibility for abstract concepts,
but falls short on the many objects which we discuss.  Anyone
seen anything better?  Or start the monthly mod.keyword.list?
-- 

Scot E. Wilcoxon	Minn. Ed. Comp. Corp.      circadia!mecc!sewilco
45 03 N / 93 15 W	(612)481-3507 {ihnp4,uwvax}!dicomed!mecc!sewilco

chuqui@nsc.UUCP (Chuq Von Rospach) (10/13/85)

In article <834@vortex.UUCP> lauren@vortex.UUCP (Lauren Weinstein) writes:
>I almost couldn't stop laughing here when I saw Chuqui's article
>claiming "he didn't want to waste the network's time with point-by-point
>refutations of arguments."  Uh huh.  He has always seemed perfectly
>happy to post long point-by-point refutation articles when he thought
>he had some valid way to argue.  But hey, if he has nothing to say
>on some of these points (whatever the reason), that's OK by me.

My hope on going public with my NNTN project was to try to get some
reasonable feedback. As seems to be typical of most of the network, and of
Lauren in particular, all I've gotten are rather childish attempts at
minimizing any attempt to do something positive for this beast we
laughingly call a network. I'm now, thanks largely to the continuing
attempt by Lauren to ridicule anything I say, quite sorry I ever opened my
mouth to try to make this place better. If I continue on NNTN, and that is
by no means sure, it'll be silently. If I have results I feel are
promising, I'll bring them forward, but I'm no longer going to sit here and
argue with braying idiots who think that facts are superfluous to their
religious beliefs. I'll let my results, success or failure, speak for
me in the future. I'm frankly tired of pretending to respect people who
don't deserve respect, or wasting my time trying to hold a discussion with
throwers of rotten tomatoes. If I want advice, I'll go ask a tree stump --
I think I'll get a LOT more information out of it that I get out of the
idiots... 

Well, I'm now VERY sorry I made any public comment at all, because all
it has done is give Lauren another topic to practice his petty
character assasinations on me. Given five or six more years, he might
even get them right. Lauren, I don't think it is any secret that we
don't like each other. I'm rather proud to admit publicly that I don't
like you. I also believe that technical discussions should be dealt
with on a technical level, something you evidently haven't figured out.
So, as my final posting on the subject, let me drop something down to
your level and see if it sinks in.....

			    *PPHHHHHHHHHTTTTTTTT*

(I've been wanting to do that for over a year.... Ahhhhhh.......)

Now that I've gotten my obligatory childish behavior out of the way, I'd
also like to point out that when Lauren DOES decide to act mature, he has
the technical knowledge to really help pull things together (stargate
proves that point). I only wish he'd use his brain more often.

A final boon to my erstwhile loyal opposition: Since I'm leaving National
and the front line of the network, I'm also going to drop this discussion
since it isn't going anywhere. Now that I've offended Lauren's manhood, he
can post a followup ranting and raving at me, defend his ego, and since he
will have gotten in the last word, believe that he won the argument. The
least I can do for everything Lauren has done for me over the years....

Enjoy yourself, Lauren. You won't have to worry about arguing with facts
for a while -- most people seem to enjoy raw emotional outbursts...

Anyway, a few final points.... hopefully not in the massive copious detail
Lauren seems to dislike (my apologies for that, but facts tend to take up
space, and I like to have a basis in fact for my comments)

>My arguments against keyword-based netnews are not new.  I've pointed
>out these same issues whenever this topic has appeared (and it has
>appeared in the past, and has been argued about in netnews, long
>before Chuqui appeared on the scene).

Well, there is an implied "Chuqui's a young whippersnapper, listen to me
because I've been solving these problems since he was in diapers" comment in
there. Well, I could make a snide comment about old and senile hackers, but
that doesn't contribute to the situation... I'll even agree that his
arguments aren't new. Old arguments aren't neccessarily right, they're just
old....

>I consider keyword-based netnews,
>if implemented in the manner so far described, to be ANOTHER PROBLEM,
>not a solution.  My opinions regarding the way to solve our
>existing problems are pretty well known, and I think that most
>of the network knows how I personally am proceeding.  I feel that
>moderated groups and alternate services (e.g. Stargate) show the most
>promise for providing useful services in the medium to long-run.

Well, I have supported Stargate from day one, because I felt that it was
something who's usefulness could only be proven by implementation and test.
That was despite that fact that I personally (and until now, silently) feel
that Stargate is dead wrong, and the chance of taking it into a full and
useful production scheme the size of our current network is impossible.
Even though I'm completely against Stargate, I believe that we can learn
from our failures (or be pleased by our successes) and that we should carry
forward and see what happens. Unfortunately, I seem to be unique in these
thoughts, because the attitude of others is to stomp out anything that they
don't personally agree with. Even if NNTN is a complete blowoff, we could
learn something in the implementation that make whatever comes next, but
Lauren and others seem to feel that if it isn't obviously perfect (by their
frames of reference) in the first design discussions, it should simply be
tossed away... 

>I feel that "solutions" that do not include some form of moderation
>are stopgaps that will have no long-term usefulness as the network
>continues to grow.

I feel that "solutions" that DO include moderation may work on some nets,
but won't work on USENET. You want a different net, fine, but I want the 
decision to read or not read an article in the hands of the reader. I
CERTAINLY wouldn't want a newsgroup moderated by Lauren, if only because he
and I disagree on everything and therefore the stuff I'd consider
interesting wouldn't get in.... What we need to do is build a system that
makes it easier to screen messages and less likely to mispost messages.
The only person I really trust as moderator is me, and I wouldn't trust me
as moderator for anyone else. The one thing I think we NEED to save if we
want USENET to survive is the spontaneity, and stargate and other moderated
schemes lose that. Without the ability to post without supervision, you no
longer have USENET, and I think you lose a lot of the appeal and wonder
that makes this net as much fun as it is. (wow, Lauren and I disagree
again. Suprise suprise). 

Despite that, I've been happy to try to move stargate forward and kept my
disagreements to myself -- let a project stand or fall on its own merits...
Altruistic? hell, no. Just interested in getting a better net, and treating
people right. Ideas well ahead of their times, it seems.

>Keyword systems do nothing to help control our traffic and cost
>problems.  In fact, they make it MORE likely that discussions will
>mushroom off in bizarre directions (no pun intended) and make it HARDER
>for individual sites to control what they wish to send or receive.

Who says? NNTN isn't nearly at the detail of design to say one way or
another. Why can't we simply set things up so that specific keywords don't
get forwarded? (things like flame, bizarre, jokes, lauren, whatever...)
That is actually BETTER than the current situation because you exclude
anything posted to a specific keyword. Now, if something is cross posted to
an excluded group and a group you do get, you get the posting, which means
that turning off a group doesn't do as well as you would hope -- especially
with heavily cross posted groups like net.flame.

>Keyword systems don't even necessarily help find items of interest,
>due to the four error modalities already discussed in
>previous messages.

I don't see that this is better or worse with keywords than it is with the
current situation. It is just as difficult to track down the appropriate
place for things with the current group structure as it is with keywords.
Keywords may well not be better than newsgroups, but they also aren't
worse, and they give you a level of flexibility that is difficult to design
into our current structure.

Well, have at me, Lauren. I've been looking forward to tossing a posting at
you like the one's you've been lobbing at me for a long time. I don't
expect it to be very productive, but I certainly feel better for it...
Hope you don't mind a little of your own medicine...

Tired of trying to act like a gentleman in a room full of pigs, call me:

	Ishmael
-- 
:From the caverns of the Crystal Cave:  Chuq Von Rospach 
Currently: nsc!chuqui@decwrl.ARPA       {decwrl,hplabs,ihnp4,pyramid}!nsc!chuqui
Soon to be:				..!sun!<somethingorother>

Our time is past -- it is a time for men, not magic. Come, let us leave
this world to the usurpers and rest our weary bones....

preece@ccvaxa.UUCP (10/14/85)

> /* Written  4:58 pm  Oct 10, 1985 by sewilco@mecc.UUCP in
> ccvaxa:net.news */ What really is necessary is a widely available index
> of concepts which could be used by posters and moderators as a
> standard.  Roget's Thesaurus has one possibility for abstract concepts,
> but falls short on the many objects which we discuss.  Anyone seen
> anything better?  Or start the monthly mod.keyword.list?
----------
Roget's Thesaurus is, obviously, inadequate for technical topics, but
it's in the right direction.  The two major bibliographic databases in
this area are Engineering Index and Inspec's Computer and Control
Abstracts.  Both use several indexing schemes, including controlled
vocabulary and natural language indexing applied by indexers.
The ACM has a more limited CS classification scheme used for
Computing Reviews.  Indexing using a large controlled vocabulary is
not a task for a beginner.  It takes a LONG time to learn how
the vocabulary works and how to determine the most appropriate
indexing for a particular document.

-- 
scott preece
gould/csd - urbana
ihnp4!uiucdcs!ccvaxa!preece

lauren@vortex.UUCP (Lauren Weinstein) (10/14/85)

The temptation to ignore Chuqui's last outburst is considerable.
While I've certainly not hidden my feelings that keyword-based news
schemes are not appropriate for Usenet, I can't recall ever saying
that Chuqui or others shouldn't work on it if they so desire.  Nor
can I recall calling him or others "idiots," "pigs," or other
similar terms that he found it necessary to use in what he himself called
his "obligatory childish behavior."  

Upon reflection, I suspect that the biggest problem is that many
persons are simply not familar with the work and problems already
done in the areas of query/response and keyword systems.  It isn't
as if it's a new invention.  Such systems have existed for quite a long
time, and a considerable body of published work exists that clearly
point out the positive and negative aspects of such systems.  One previous
poster on this topic mentioned some of the formal terms of these
systems--I've steered clear of the formal terminology since I figured
most people wouldn't be too interested, but perhaps some formal
discussion of the theory and practice of these systems is in order at
some point.

In any case, let's briefly address a few issues from Chuqui's message:

> My hope on going public with my NNTN project was to try to get some
> reasonable feedback. As seems to be typical of most of the network, and of
> Lauren in particular, all I've gotten are rather childish attempts at
> minimizing any attempt to do something positive for this beast we
> laughingly call a network.

How many people remember some of the incredible pressure I was under
when I first proposed Stargate?  It makes the sorts of messages we've
seen here on the topic of keywords seem like nothing by comparison.
I never saw any message that said, "People who work on keyword systems
are idiots."  What I did see were messages (some of them written by me)
that said, "Keyword systems (as described) won't work in the Usenet
environment and may create far more problems than they would solve."
Personal opinion to be sure.  Not based on a desire to see Usenet die,
but rather based on the desire to avoid seeing new problems created.

> Well, there is an implied "Chuqui's a young whippersnapper, listen to me
> because I've been solving these problems since he was in diapers" comment in
> there. Well, I could make a snide comment about old and senile hackers, but
> that doesn't contribute to the situation... I'll even agree that his
> arguments aren't new. Old arguments aren't neccessarily right, they're just
> old....

Then again, old arguments might be right, too.  Ignoring history is
often a serious mistake.  When topics have been discussed in the past,
and when a body of technical work concerning the topic of interest
already exists, a great deal of time can be wasted if a person chooses
to simply ignore all that has come previously.  Whether this is done
on purpose or through naivete doesn't much matter--the result is 
usually the same.  An important issue revolves around how much time
is spent "re-inventing" the wheel, only to come up against the same
old problems, in such situations.  Another issue concerns whether
or not well-intentioned efforts that might have a short-term benefit
create additional long-term problems.

> Stargate...

Stargate is designed to be a medium and long-term alternative for
collecting and transmitting information.  It isn't meant to provide
exactly the same sorts of "services" we get now from Usenet.   
As an aside, the project is going quite well, and I hope to have
some significant announcments regarding service organization and
availability in the fairly near future.  I hope to have more
hardware available soon to allow more sites to receive Stargate
transmissions--mass production of the decoders is already underway,
and the prototype "buffer box" is under construction now.
At the same time, various non-technical discussions relating to the
evolution of the project from an experiment to a service continue.
More in net.news.stargate as developments warrant.

> I feel that "solutions" that DO include moderation may work on some nets,
> but won't work on USENET. You want a different net, fine, but I want the 
> decision to read or not read an article in the hands of the reader. I
> CERTAINLY wouldn't want a newsgroup moderated by Lauren, if only because he
> and I disagree on everything and therefore the stuff I'd consider
> interesting wouldn't get in.... What we need to do is build a system that
> makes it easier to screen messages and less likely to mispost messages.

This is a fine short-term concept.  And keywords used in ADDITION to
newsgroups might be of use in that area.  My primary objection to keyword
systems appears when people want to REPLACE newsgroups with keywords.
In essence, newsgroups provide something that has been found to be critical
in real-world keyword systems--keyword list control.  That is, a newsgroup
is, in essence, a base keyword that all users are required to choose from
which provides a conceptual "anchor" for the message.  If they want to
add additional keywords also (as some people do now) that's OK... people
with the appropriate software may choose to use or ignore those
keywords (of course, they should keep in mind the keyword error modalities
we've discussed previously when making such a decision).

But without some sort of "forced keyword selection" (which is what
newsgroups really are) we're faced with a serious problem.  Sites
are put at the mercy, when trying to decide how to spend their time
and money, of the keyword choices of individual users.  Variation of
keywords has been shown to be one of the biggest problems with
uncontrolled keyword-based systems.  Not only does it make it difficult
to find articles of interest, but it makes controlling what you DON'T
want to see very difficult.  There are just TOO MANY KEYWORD POSSIBILITIES
in an uncontrolled system, even assuming that all users choose keywords
conscientiously, accurately, and in detail.

Enough on this point.  I'll just add that virtually all "successful"
keyword-based systems have a central authority that controls keyword
use.  Often this authority actually chooses the keywords, or else
"corrects" poor user keyword choices before letting articles enter the
database.  Often official keyword lists are also published or made otherwise
available so that users will see what sorts of words will appear and 
thusly allow intelligent usage of keywords on the part of both posters
and readers of articles.  Keyword control is CRITICAL.  I can't emphasize 
this enough.  I don't see any way to make this work in a distributed 
environment such as Usenet.  Newsgroups are the underlying structure that
holds the current net together to the extent that it is now.

> The only person I really trust as moderator is me, and I wouldn't trust me
> as moderator for anyone else...

This would be OK, if there were no costs associated with traffic.  Let's
take an example.  Let's say we had this "perfect" AI program that could
precisely and accurately filter all our incoming traffic and show us ONLY
what we TRULY wanted to see.  It's never fooled--it never makes mistakes.
EVERYONE is running it (so we don't have to worry about the poor slobs
running older software who have no way to do automatic filtering!)
Would this solve Usenet's problems?  Of course not.  The problem is that
getting the traffic to the AI program costs money, time, and other
(hardware) resources (disk, dialups, CPU cycles, etc.)  What happens
when we have 100,000 sites on the net?  Can't happen?  Well,
probably not, for much the same reason that we're unlikely to reach
a population of 100 billion on this planet--everything will collapse
long before then.  But traffic continues to grow, and the percentage of
useful traffic, by virtually ANY reasonable definition, will continue
to decline.  When ANYBODY, ANYWHERE on the net, can post a message to
EVERYONE, we start to look into the face of problems that will
be nearly exponential in nature.  To take an old example:  Someone asks
the network what "foo" means.  What do we do when a few thousand
people respond?  Or more?  Even given the fancy AI program that can
show us only the "meaningful" responses--we've still paid to send all
those answers (99.9% of which will impart no new useful 
information) throughout the world.  As the net grows, this sort of behavior
will simply become impossible to support, from neither a time nor resource
standpoint.

---

The course that Usenet is taking is becoming very clear.  It isn't 
even necessarily a BAD course--but it's in keeping with evolution.
What we're going to see is increased fragmentation.  The recent
announcement by utzoo regarding newsgroup cutoffs is an example
of such fragmentation in action.  As the volume of materials 
continues to grow more sites will be forced to make hard decisions
about what they can afford, under various criteria, to support.

To the extent that some sites feel wealthy enough to continue taking
full traffic in an ever-expanding network with 10's of 1000's of sites,
they will be free to do so.  After all, any site can arrange to call
any other site and pass whatever articles they wish.  Other sites
may wish to try alternatives (e.g. Stargate) which will offer what
will hopefully be a far more cost effective and lower noise information
flow.  The model of rapid-turnaround "information conduits," with
users submitting items for "publication through the conduits," is the
one I like to use for Stargate.  To the extent that people like or dislike
the way these conduits are managed the service will evolve and change.
Persons with the resources to support all or part of the free-for-all
on Usenet can do so also, of course.  Participating in Stargate doesn't
require giving up everything else.

It will always be up to the individual sites to make
these decisions.  Usenet won't just DIE.  But its nature will gradually
continue to change as more sites join the fray, and as the volume
of postings continues to increase.  The noise level WILL continue
to rise in an unmoderated environment, and traffic will 
continue to grow rapidly to the extent that backbone cutoffs do not occur.
Article filtering techniques may have some short term benefit--but only
if they do not make it MORE difficult for sites and users to 
accurately control their traffic, costs, and time.  Traffic growth,
however, will prevent such techniques from being a long-term solution to 
what are really systemic problems in Usenet itself--problems that have
appeared as netnews grew far beyond the size envisioned by those who 
created it.

--Lauren--

hes@ecsvax.UUCP (Henry Schaffer) (10/16/85)

> Upon reflection, I suspect that the biggest problem is that many
> persons are simply not familar with the work and problems already
> done in the areas of query/response and keyword systems.  It isn't
> as if it's a new invention.  Such systems have existed for quite a long
> time, and a considerable body of published work exists that clearly
> point out the positive and negative aspects of such systems.
> 
> ...  I'll just add that virtually all "successful"
> keyword-based systems have a central authority that controls keyword
> use.  Often this authority actually chooses the keywords, or else
> "corrects" poor user keyword choices before letting articles enter the
> database.  Often official keyword lists are also published or made otherwise
> available so that users will see what sorts of words will appear and 
> thusly allow intelligent usage of keywords on the part of both posters
> and readers of articles.

  Let me suggest something that each person can experience.  Go to your
university or company library and talk to the reference (or other
appropriate) librarian about doing a computer based search of some
bibliographic database in a field in which you have an interest.  Choose
a topic with which you are familiar so you will be able to estimate
what percentage of the material you get is undesired, and what percent
of the literature which is relevant doesn't show up in answer to your
request.
  You'll probably have to sit down with pages of rules and keywords,
and with help, and with several iterations you may very well get what
you know is a good answer.  (To be reasonable about this, you have to
choose a *topic* that has some coherence but is not too specific.  It
is easy to say "Give me all articles about unix", or "give me all the
articles authored by xyz since 1980".  I'm thinking more of the type of
topic on which you would actually be doing research - e.g., comparative
performance of different datacom network architectures.
  What keywords would one use for that?  "data communication" would
include much of what you want, and a *ton* of other things.  How about
anything with "queue" or "queueing" as a keyword.  Whoops, look at all
that telephony and industrial engineering stuff.  Let's add star, ring,
..., whoops again- lets qualify them with "data communication", whoops,
we lost many articles which were so obviously on data communications,
but were labeled "computer communications", or whatever.
  Last time I did a search, I was given a *book* of keywords, and was
told that authors had been *required* to describe their research 
using those (and only those.)  Therefore, I could only request from
that book, along with operators to allow generalizations like other
endings.
  After you have gone through a few non-trivial searches, then let's
continue this discussion.
--henry schaffer
Disclaimer: I really don't want to get into the ad hominem aspects of
this discussion, and refuse to comment directly on what view(s) I
personally favor.

tim@k.cs.cmu.edu.ARPA (Tim Maroney) (10/17/85)

Aw, poor Chuqui.  Someone criticized his grandiose proposal for large-scale
revision of USENET.  Chuqui would never mindlessly flame people who proposed
new features, would he?  No, of course he wouldn't.  After all, he's the
good guy.
-=-
Tim Maroney, CMU Center for Art and Technology
Tim.Maroney@k.cs.cmu.edu	uucp: {seismo,decwrl,etc.}!k.cs.cmu.edu!tim
CompuServe:	74176,1360	My name is Jones.  I'm one of the Jones boys.

henry@utzoo.UUCP (Henry Spencer) (10/19/85)

A friend of mine who recently attended a talk by Mike Lesk came back
with an interesting tidbit of information that is of some relevance to
this discussion.  Apparently Lesk has done some work on keyword choice,
and has made an interesting discovery:  of the various methods he tested
for automatic keyword selection, distinctly the most effective was to
use the first 30 words of the main text and *ignore* any author-supplied
keywords.  This is thirdhand info, and I don't know what other methods
he tried, but it's a nice comment on the ability of authors to pick good
keywords without careful guidance.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

lkk@teddy.UUCP (10/21/85)

In article <6061@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
> Apparently Lesk has done some work on keyword choice,
>and has made an interesting discovery:  of the various methods he tested
>for automatic keyword selection, distinctly the most effective was to
>use the first 30 words of the main text and *ignore* any author-supplied
>keywords.


This is certainly true with mail.  Our system here informs you of
new incoming mail by printing the first two or three lines on your
screen.  Its almost always possible to tell exactly what the
message is about from that information.

-- 

Sport Death,
Larry Kolodney
(USENET) ...decvax!genrad!teddy!lkk
(INTERNET) lkk@mit-mc.arpa

Life is either a daring adventure,
or nothing.
- Helen Keller

lauren@vortex.UUCP (Lauren Weinstein) (10/22/85)

I think I'd agree that just using the words from the first part of
a message (if we could find and ignore all the included text in different 
forms from older articles, the cute opening lines and line eater bug lines)
might be better than people trying to pick their own keywords.  
(Actually, skipping the previously included text might not work, since
often people add comments on the end of such text that would have no
meaning without the included text.  This means that you're stuck
trying to keyword both the included text (again!) and the "new" text.)

But even if you did the above and did a fairly good job of it,
it's still not good enough.  It's only "better" since
letting people (in an uncontrolled keyword environment) pick their
own keywords is SOOOO bad.  

We (humans) can tell what the meaning of a message is (much of the time)
quite quickly because we do considerable analysis of the text while
we're reading!  We automatically ignore the "extraneous" words in a manner
that would be difficult for even a sophisticated program to accomplish.

And the big problems still remain.  People's random word choices
when they write their text result in massive keyword list expansion.
Without centralized keyword control, making good keyword search choices 
remains exceedingly difficult (even with such control, it's still very 
difficult).  Also, analysis of word formats, pluralisms, usage, etc.
still must be considered and are non-trivial problems.  All four of the
keyword error modalities still exist, as do the related control and
coordination problems.

Also, we must not forget the percentage of messages that will horribly
fail the "beginning of message" meaning test and generate all sorts
of "noise" into the keyword systems.

No, it just doesn't fly.  What I think was really being said was
that people's choices for keywords are SO BAD that EVEN taking words from 
the first part of text is better.  But that doesn't mean that taking those
words solves the fundamental problems with keyword systems (which
we've hashed over quite extensively in this group as of late!)

The only way to make a keyword system work at all, even "moderately"
well, in any environment, is to have a centrally controlled and organized
keyword base, with keywords being carefully selected and organized by
people who have the time and inclination to do such work.
I don't see this happening in the Usenet environment, for technical,
logistical, and also "sociological" reasons.

Also, the above doesn't even start to address the problems that any keyword 
system (that might be designed to replace newsgroups) would cause for 
traffic control at sites that needed to limit certain
kinds of traffic.  Nor does it address the fact that even WITH carefully
controlled and "professionally" chosen keywords such systems take
a great deal of time and practice to use even minimally well.

I'd like to second the idea that someone else already posted.  If you
think you like the idea of keyword systems but are personally
unfamiliar with the way REAL keyword systems work--go to a library
and try out their search services.  Keep in mind that they operate with
a VERY carefully controlled keyword base--and be sure to pick a topic
where you'll know how many articles you're MISSING during your searches,
and how many UNDESIRED ones you're getting as well.

It may be an interesting experience for you. 

--Lauren--