lauren@vortex.UUCP (Lauren Weinstein) (03/08/85)
While I personally am not in favor of keyword-based netnews, I might
point out that Chuqui's calculations are based on a somewhat erroneous
keyword model. The "right" way of designing keyword-based systems is not
necessarily to store keywords for each article item, but rather to
have a list of keywords and store the item numbers that correspond to
each keyword.
For example, Stanford's newswire scanning program ("NS") makes
EVERY word in EVERY newswire item (except for a list of "common"
words that are automatically excluded from the tables) a keyword.
You then pick out individual stories with boolean keyword
expressions. A randomly selected example:
((love+sex)*handcuff)-chuqui)
This expression would find all stories that mention
the words "love" OR "sex" that also mention the word "handcuff".
Also, it will exclude all stories that fit this critera but
that also include the word "chuqui".
The software automatically tries to handle plurals and
special suffixes. There are some problems with this keyword
technique, admittedly. You can't currently specify that two
words should be next to each other in a story. And you still
tend to get lots of erroneous keyword matches that aren't
what you are looking for due to the strange places that some words
tend to pop up in stories. Still, it is pretty useful, *if* you
are good at picking the keywords to put into the search
expressions. This is something of an art, however, and is not
easily mastered. If you do it wrong, you can miss many
interesting stories.
Of course, this is a pretty big program and the database is still
non-trivial, to say the least. But frankly, I don't think that
systems based on users' selecting their own keywords will be
useful in our environment. The technique above is an alternative,
but probably not practical for smaller machines. So, I currently
feel that keyword-based news is not really the way to go.
--Lauren--chuqui@nsc.UUCP (Chuq Von Rospach) (03/09/85)
In article <591@vortex.UUCP> lauren@vortex.UUCP (Lauren Weinstein) writes: >While I personally am not in favor of keyword-based netnews, I might >point out that Chuqui's calculations are based on a somewhat erroneous >keyword model. The "right" way of designing keyword-based systems is not >necessarily to store keywords for each article item, but rather to >have a list of keywords and store the item numbers that correspond to >each keyword. Agreed-- the experiments I discussed were just one implementation I tried-- I also looked at using keyword->article_id lookups as well, and it has similar problems in different ways. Some of the tradeoffs are better, some, to me, aren't. Overall, I still am not sure that there is a good keyword system for the amount of data we have with the number of keywords we SHOULD keep to make the system really useful. I'm especially worried about disk space and processor overhead-- two things a lot of news systems already have in short supply. Even if we can get disk usage down to a 25& increase (my results showed me about a 50% increase with my preliminary designs) you're still talking about 3-5 megabytes of keyword database, and that would be a significant problem for some sites. Generating and maintaining that data would also be a significant processor load for many sites (not all of us have Vaxen). Perhaps they can be worked around, and I'm still looking at the situation, but I don't see any easy answers. >Still, it is pretty useful, *if* you >are good at picking the keywords to put into the search >expressions. This is something of an art, however, and is not >easily mastered. If you do it wrong, you can miss many >interesting stories. this is my other worry-- I don't want to see us moving in directions that make usenet LESS useful. I want to see usenet made better and more effective. Somehow. I think we all agree with that hope. chuq -- Chuq Von Rospach, National Semiconductor {cbosgd,fortune,hplabs,ihnp4,seismo}!nsc!chuqui nsc!chuqui@decwrl.ARPA Be seeing you!