lauren@vortex.UUCP (Lauren Weinstein) (09/30/85)
For quite a few years, I've been using a very elaborate keyword-based system for searching a large newswire story database. This database is in a centralized location so there is no concern about COSTS associated with extra matches, unlike the Usenet situation. One thing I learned long ago thanks to this system--it is almost IMPOSSIBLE to avoid major missed matches AND extra matches. If you try to make your keyword choices very specific and negate out topics of no interest, you frequently (*VERY* frequently) find that you're missing great numbers of stories that you really DID want to see, but where a particular keyword you specified wasn't used. Or you find that *MANY* stories you wanted to filter OUT still get through since the keywords you wanted to SKIP weren't used. There are so many similar ways to specify keywords, and there are so many personal choices involved, that getting the proper match between the person choosing the article keywords and the person trying to find (or ignore) particular stories is very difficult. In a keyword-based news system, with users attempting to choose their own keywords (and probably spelling them wrong part of the time, or leaving typos in them, let's face it!) getting CORRECT matches without getting lots of ERRONEOUS matches would be a nightmare. Let's say I wanted to see all stories that discussed TELEPHONES. But what if a story about AT&T was only keyworded with "PHONES" or "COMMUNICATIONS"? Well, you of course never see those stories. The same sort of problems can occur in the reverse direction when you're trying to avoid certain stories. It is VERY hard to create flexible keyword-based systems that avoid these problems. The issues involved with parts-of-speech and word usage alone are very significant. Even the advanced systems won't match on PHONE when you want TELEPHONE... there are infinite similar examples. Even if you're willing to sit for five minutes trying to figure out all the "correct" keywords for a article when you submit it, you still frequently make personal choices that are not going to match another person's veiw of that same article. Two people will tend to keyword any given article in different ways. This means that matching is a serious problem. Before people jump on the keyword bandwagon, I STRONGLY suggest that some time be spent looking at the numerous problems with existing keyword-based systems, such as DIALOG. I've used that service quite a bit, and it is very, very frustrating to wade through lots of junk you didn't want, and miss items you did want, due to keyword "mismatch" problems of various sorts. For netnews sites trying to cut back on the phone bills by only sending, for example, technical items, the volume of erroneously matched stories could be massive. The odds are that about half the stories that would be sent would be "incorrect" and that about half of the stories you WANTED to send woudln't get sent. There is a lot of existing research in keyword systems that the proponents of keyword-based news seem to be ignoring. My own opinion is that in our distributed environment, with volumes of material and costs going up steadily (and many sites faced with cutting back on both, one way or another) keyword-based systems might make our current mess look like a paradise by comparison. --Lauren--
chuqui@nsc.UUCP (Chuq Von Rospach) (10/02/85)
In article <820@vortex.UUCP> lauren@vortex.UUCP (Lauren Weinstein) writes: >For quite a few years, I've been using a very elaborate keyword-based >system for searching a large newswire story database. This database >is in a centralized location so there is no concern about COSTS associated >with extra matches, unlike the Usenet situation. > >One thing I learned long ago thanks to this system--it is almost >IMPOSSIBLE to avoid major missed matches AND extra matches. If you >try to make your keyword choices very specific and negate out topics >of no interest, you frequently (*VERY* frequently) find that you're missing >great numbers of stories that you really DID want to see, but where >a particular keyword you specified wasn't used. Or you find that *MANY* >stories you wanted to filter OUT still get through since the keywords >you wanted to SKIP weren't used. Lauren has a point, but if this system is like all of the other newswire searching systems I've seen it has limited applicability to a keyword based news system. The problem is that doing keyword searches on a general database IS going to bring forward lots of silly matches because the words just happen to be used in otherwise unrelated articles. What I'm planning on doing for NNTN, though, is to have the author attach the appropriate keywords to the article. Rather that simply grepping text for the words, you look at only the keywords the author thinks is important. Even if the author is completely incompetent with this keyword selection this should keep the accidental matches down to a minimum. You can't ignore the problem, but you also have to realize that Lauren's example is to a great extent an Apple and Orange comparison to USENET and its problems. -- :From under the bar at Callahan's: Chuq Von Rospach nsc!chuqui@decwrl.ARPA {decwrl,hplabs,ihnp4,pyramid}!nsc!chuqui If you can't talk below a bellow, you can't talk...
lauren@vortex.UUCP (Lauren Weinstein) (10/04/85)
First of all, Chuqui, I noticed that you ignored the second part of my argument, where I pointed out how limited or poor keyword choices result in many MISSED articles. Ya see, that's the problem with keyword systems. Put in too many keywords, or "inappropriate" ones, and you get all sorts of mismatches. Put in too few, or (once again) "inappropriate," ones and you miss most of the articles you really wanted to see. And both these points apply both to the person choosing the keywords to go with the article AND to the person searching for articles of interest. In other words, there are four different modalities of screwup in such systems, plus combinations, of course. Ya' want something with greater applicability to netnews? OK, try DIALOG or any of the other large commercial database systems where keywords are assigned on a carefully organized basis, and are kept fairly limited to (supposedly) *try* avoid many mismatches. They still are horribly mucked up. It takes a great deal of real skill to choose correct keywords (either when posting an article or searching for one). And even with skill and practice, you end up with piles of junk AND missing items of real interest. Many of the commercial services have people who do nothing all day but read articles and spend a great deal of time assigning keywords that will hopefully maximize correct matches. You know what? You STILL get floods of useless matches (you'd be amazed) and you still miss 80% of the stuff you really wanted. And that's with pro's spending lots of time choosing appropriate words, not some frazzled netnews user trying to dash something off in a hurry. I'm sorry Chuqui, but I've used lots of keyword systems (both commercial and non-commercial in all sorts of different applications) and I consider them to be a real mess. Even very elaborate, sophisticated systems are a royal pain to use. And most of these systems don't have the additional consideration of trying to decide what material they can afford to pass on to other sites, and of avoiding mushrooming of discussions into all sorts of sidetracks that can massively increase costs. In other words, keyword systems tend not to work very well even in centralized environments where costs are not a significant factor. In our distributed environment, keywords cannot replace newsgroups without causing an immense amount of waste, hassle, and increased costs. Depending on keywords also brings forth the problems of unwanted and missed articles discussed above. In a time when many sites are faced with either limiting traffic or dropping off the net entirely, keyword systms, apart from the hassles they cause the users ("pros" as well as casual users) could make attempts at thoughtful traffic limitations impossible, and the result could be the loss of hub sites and many other sites as the traffic continues to grow. --Lauren--
mjl@ritcv.UUCP (Mike Lutz) (10/05/85)
In article <820@vortex.UUCP> lauren@vortex.UUCP (Lauren Weinstein) writes: >For quite a few years, I've been using a very elaborate keyword-based >system for searching a large newswire story database ... > >One thing I learned long ago thanks to this system--it is almost >IMPOSSIBLE to avoid major missed matches AND extra matches. Lauren, as usual, is right on the money. This problem is known to the Information Retrieval folks as precision vs. recall. Precision is the fraction of retrieved items that are relevant; recall is the fraction of relevant articles retrieved. As Lauren has noted, you generally have to trade one off against the other. And this ignores entirely the subjective nature of 'relevancy.' There's been a lot of work in this area (see, for example, the evolving SMART system from Salton's group at Cornell). However, the incremental CPU time needed to make even small gains in both precision and recall can be staggering. When combined with the volume of database updates represented by a day's worth of news, I don't see how the use of keywords at the the transmission level is practical. -- Mike Lutz Rochester Institute of Technology, Rochester NY UUCP: {allegra,seismo}!rochester!ritcv!mjl CSNET: mjl%rit@csnet-relay.ARPA
chuqui@nsc.UUCP (Chuq Von Rospach) (10/05/85)
In article <825@vortex.UUCP> lauren@vortex.UUCP (Lauren Weinstein) writes: >First of all, Chuqui, I noticed that you ignored the second part of >my argument, where I pointed out how limited or poor keyword choices >result in many MISSED articles. Actually, I didn't ignore it, I simply didn't respond to it because I didn't feel adding more needless postings to the issue will get us almost as far as the discussions about domains have in net.mail -- nowhere. If I spend my time refuting every article on the net on a point by point basis, all I'd ever do is refute articles on a point by point basis. Where I think a clarification is neccessary, I'll clarify. Where I think someone is off base, I'll try to explain the concepts better. When a discussion turns religious, I'll bow out because I'd rather finish the design and get it implemented and SEE if it works. We can argue forever on the subject (since you seem to be against it and I seem to be for it) without convincing each other or anyone else, and the end result is a lot of words, a few bruised egos, and absolutely no results. I'd rather at least TRY the silly thing and if it doesn't work get rid of it again. I don't ignore stuff, Lauren, I just don't always respond to things because I don't want to waste the net's time and volume arguing aimlessly or reiterating things for the 400th time... >Put in too many keywords, or "inappropriate" ones, and >you get all sorts of mismatches. Put in too few, or (once again) >"inappropriate," ones and you miss most of the articles you really wanted >to see. And both these points apply both to the person choosing the >keywords to go with the article AND to the person searching for articles >of interest. In other words, there are four different modalities >of screwup in such systems, plus combinations, of course. Those modalities exist in the current news because of the way newsgroups are set up (newsgroups are just a funny flavor of keyword with very little flexibility, hooks into the database layer (sigh) and a distribution tacked on). If I don't solve that problem (and I very well may not) but simply carry them along into a new interface that gives you a new flexibility and some added power (and/or be a bit easier to use and/or a bit harder to misuse) what have we lost? If I can solve SOME of netnews problems, I will be more than happy. If I can solve ALL of them, I'll run for president or something. I think I have a chance to make things better. I'm not even trying to make things perfect... >I'm sorry Chuqui, but I've used lots of keyword systems (both commercial >and non-commercial in all sorts of different applications) and I consider >them to be a real mess. Even very elaborate, sophisticated systems >are a royal pain to use. And most of these systems don't have the >additional consideration of trying to decide what material they >can afford to pass on to other sites, and of avoiding mushrooming >of discussions into all sorts of sidetracks that can massively >increase costs. In other words, keyword systems tend not to work >very well even in centralized environments where costs are not >a significant factor. In our distributed environment, keywords >cannot replace newsgroups without causing an immense amount of >waste, hassle, and increased costs. If you could prove your assertions, why don't we all just unplug our modems and go home? If things really ARE that bad, Lauren, why haven't you just split? Rather than simply doomsaying, why can't we recognize the problems (something you seem quite good at) and try to make things better. We may well fail horribly, but doing nothing guarantees failure, and doing something may help the odds out. If you spill some milk on the floor, how do you react? Do you slit your throat for being a clumzy oaf, or you do go find your cat? Slitting your throat keeps you from spilling it again, but it creates all sorts of complications. If you find the cat, the spill can be cleaned up AND you can keep your cat happy. I'd rather go off looking for my cat, thank you -- that knife looks terribly sharp... -- :From under the bar at Callahan's: Chuq Von Rospach nsc!chuqui@decwrl.ARPA {decwrl,hplabs,ihnp4,pyramid}!nsc!chuqui If you can't talk below a bellow, you can't talk...
lauren@vortex.UUCP (Lauren Weinstein) (10/08/85)
I almost couldn't stop laughing here when I saw Chuqui's article claiming "he didn't want to waste the network's time with point-by-point refutations of arguments." Uh huh. He has always seemed perfectly happy to post long point-by-point refutation articles when he thought he had some valid way to argue. But hey, if he has nothing to say on some of these points (whatever the reason), that's OK by me. --- My arguments against keyword-based netnews are not new. I've pointed out these same issues whenever this topic has appeared (and it has appeared in the past, and has been argued about in netnews, long before Chuqui appeared on the scene). I consider keyword-based netnews, if implemented in the manner so far described, to be ANOTHER PROBLEM, not a solution. My opinions regarding the way to solve our existing problems are pretty well known, and I think that most of the network knows how I personally am proceeding. I feel that moderated groups and alternate services (e.g. Stargate) show the most promise for providing useful services in the medium to long-run. I feel that "solutions" that do not include some form of moderation are stopgaps that will have no long-term usefulness as the network continues to grow. Keyword systems do nothing to help control our traffic and cost problems. In fact, they make it MORE likely that discussions will mushroom off in bizarre directions (no pun intended) and make it HARDER for individual sites to control what they wish to send or receive. Keyword systems don't even necessarily help find items of interest, due to the four error modalities already discussed in previous messages. Now, if someone wants to implement keywords as an ADDITION to newsgroups (not to REPLACE newsgroups) the problems are less serious--but the four error modes still exist and will present major problems, especially with everyone attempting to choose their own keywords OR automatic systems trying to do so. It takes pros to pick and manage keyword systems, and this implies a level of centralization we simply don't have and can't have in the Usenet environment. And even with pros, most of the same basic problems with keywords exist. Perhaps in Stargate some centrally organized keywords will be possible--but even in such a situation I would tend to discourage too much reliance on keyword systems. --Lauren--
preece@ccvaxa.UUCP (10/10/85)
> /* Written 9:16 am Oct 5, 1985 by mjl@ritcv.UUCP in ccvaxa:net.news > */ However, the incremental CPU time needed to make even small gains in > both precision and recall can be staggering. When combined with the > volume of database updates represented by a day's worth of news, I > don't see how the use of keywords at the the transmission level is > practical. ---------- Well, you can visualize the problem more easily if you recognize that the present newsgroup system is isomorphic to a keyword system, with the name of the newsgroup being the keyword. We get lots of false hits because people cross-post without thinking about it and we get lots of missed matches because people post things in the wrong groups. It's not clear to me that changing to an explicitly keyword based system is going to have much effect on the human failings that cause the problems. Everything in life is a tradeoff. Using keywords allows the author to specify more closely (by selecting more keywords) what the posting is about, but also makes the posting appear to be relevant to more topics: the difference is whether you consider the keywords to be ORed together or ANDed together. The best results come from using small numbers of keywords assigned from a large, carefully thought out vocabulary with hierarchical relationships among index terms, but that kind of vocabulary is hard to learn, easy to mis-use, and not well attuned to change in usage over time (ask a librarian about de-superimposition). There are also search systems based on associations between articles or between articles and queries. These offer a lot of promise, especially when (as on Usenet) the full text of the items is available. The user says "Article X is what I'm really interested in" and the system finds all the documents that are "like" article X, or the user provides a natural language description of the topic of interest and the system finds all the articles that are similar to that description. Similarity is usually based on similar vocabulary, weighted so that some words count more than others, or other common factors. Citation patterns are very strong tools for similarity, too. Articles that share many citations or that are often cited in the same place are very likely to be similar in content (don't bother sending counter-examples, all of this is biased by the law of large numbers). It's not clear that these mechanisms, developed for searching a retrospective collection of documents, are applicable to running a newsletter, which is a better model of Usenet. The user might have specific discussions (tied together by citation) that were assumed to be of continuing interest (any new items citing any of a set of existing items would be displayed), but the basic mechanism for viewing new material would have to depend on a set of profiles of interest, which is probably too specific. The reader might very well want to read generally in the area of feminism, with no more specific topic, and that kind of connection is hard to make by association unless the author has specifically tagged the item with a descriptor that can be placed in a hierarchical subject space, which brings us right back to the original problem -- that keyword might as well be the name of a newsgroup. Oh, one more problem with keyword based approaches: speed. The big database systems depend on inverted indexes: given a word, you can get a list of all the items containing that word. Maintaining that kind of index for a database whose contents change daily would be very expensive; doing without an inverted index would slow the user interface to unusability. The present system, of course, has an inverted index: the list of newsgroups. -- scott preece gould/csd - urbana ihnp4!uiucdcs!ccvaxa!preece
sewilco@mecc.UUCP (Scot E. Wilcoxon) (10/11/85)
Keywords are the wrong term. What's really wanted is selection based on the concepts or ideas discussed in an article, as well as a few tags (adjectives?..announcement, question, response, etc). The English language does not have standard names for all concepts, as well as synonyms. These are two reasons for the fuzziness of keyword searches. Most people would use "doorknob" instead of "device which opens door", but English does not force either. Even for this little message I could not use Keywords: keyword concept news thesaurus and had to resort to Keywords: keyword concept idea news thesaurus In an AI system, concepts would be nodes in a knowledge base, but I haven't heard of knowledge bases which are as wide-ranging as the net is. What really is necessary is a widely available index of concepts which could be used by posters and moderators as a standard. Roget's Thesaurus has one possibility for abstract concepts, but falls short on the many objects which we discuss. Anyone seen anything better? Or start the monthly mod.keyword.list? -- Scot E. Wilcoxon Minn. Ed. Comp. Corp. circadia!mecc!sewilco 45 03 N / 93 15 W (612)481-3507 {ihnp4,uwvax}!dicomed!mecc!sewilco
chuqui@nsc.UUCP (Chuq Von Rospach) (10/13/85)
In article <834@vortex.UUCP> lauren@vortex.UUCP (Lauren Weinstein) writes: >I almost couldn't stop laughing here when I saw Chuqui's article >claiming "he didn't want to waste the network's time with point-by-point >refutations of arguments." Uh huh. He has always seemed perfectly >happy to post long point-by-point refutation articles when he thought >he had some valid way to argue. But hey, if he has nothing to say >on some of these points (whatever the reason), that's OK by me. My hope on going public with my NNTN project was to try to get some reasonable feedback. As seems to be typical of most of the network, and of Lauren in particular, all I've gotten are rather childish attempts at minimizing any attempt to do something positive for this beast we laughingly call a network. I'm now, thanks largely to the continuing attempt by Lauren to ridicule anything I say, quite sorry I ever opened my mouth to try to make this place better. If I continue on NNTN, and that is by no means sure, it'll be silently. If I have results I feel are promising, I'll bring them forward, but I'm no longer going to sit here and argue with braying idiots who think that facts are superfluous to their religious beliefs. I'll let my results, success or failure, speak for me in the future. I'm frankly tired of pretending to respect people who don't deserve respect, or wasting my time trying to hold a discussion with throwers of rotten tomatoes. If I want advice, I'll go ask a tree stump -- I think I'll get a LOT more information out of it that I get out of the idiots... Well, I'm now VERY sorry I made any public comment at all, because all it has done is give Lauren another topic to practice his petty character assasinations on me. Given five or six more years, he might even get them right. Lauren, I don't think it is any secret that we don't like each other. I'm rather proud to admit publicly that I don't like you. I also believe that technical discussions should be dealt with on a technical level, something you evidently haven't figured out. So, as my final posting on the subject, let me drop something down to your level and see if it sinks in..... *PPHHHHHHHHHTTTTTTTT* (I've been wanting to do that for over a year.... Ahhhhhh.......) Now that I've gotten my obligatory childish behavior out of the way, I'd also like to point out that when Lauren DOES decide to act mature, he has the technical knowledge to really help pull things together (stargate proves that point). I only wish he'd use his brain more often. A final boon to my erstwhile loyal opposition: Since I'm leaving National and the front line of the network, I'm also going to drop this discussion since it isn't going anywhere. Now that I've offended Lauren's manhood, he can post a followup ranting and raving at me, defend his ego, and since he will have gotten in the last word, believe that he won the argument. The least I can do for everything Lauren has done for me over the years.... Enjoy yourself, Lauren. You won't have to worry about arguing with facts for a while -- most people seem to enjoy raw emotional outbursts... Anyway, a few final points.... hopefully not in the massive copious detail Lauren seems to dislike (my apologies for that, but facts tend to take up space, and I like to have a basis in fact for my comments) >My arguments against keyword-based netnews are not new. I've pointed >out these same issues whenever this topic has appeared (and it has >appeared in the past, and has been argued about in netnews, long >before Chuqui appeared on the scene). Well, there is an implied "Chuqui's a young whippersnapper, listen to me because I've been solving these problems since he was in diapers" comment in there. Well, I could make a snide comment about old and senile hackers, but that doesn't contribute to the situation... I'll even agree that his arguments aren't new. Old arguments aren't neccessarily right, they're just old.... >I consider keyword-based netnews, >if implemented in the manner so far described, to be ANOTHER PROBLEM, >not a solution. My opinions regarding the way to solve our >existing problems are pretty well known, and I think that most >of the network knows how I personally am proceeding. I feel that >moderated groups and alternate services (e.g. Stargate) show the most >promise for providing useful services in the medium to long-run. Well, I have supported Stargate from day one, because I felt that it was something who's usefulness could only be proven by implementation and test. That was despite that fact that I personally (and until now, silently) feel that Stargate is dead wrong, and the chance of taking it into a full and useful production scheme the size of our current network is impossible. Even though I'm completely against Stargate, I believe that we can learn from our failures (or be pleased by our successes) and that we should carry forward and see what happens. Unfortunately, I seem to be unique in these thoughts, because the attitude of others is to stomp out anything that they don't personally agree with. Even if NNTN is a complete blowoff, we could learn something in the implementation that make whatever comes next, but Lauren and others seem to feel that if it isn't obviously perfect (by their frames of reference) in the first design discussions, it should simply be tossed away... >I feel that "solutions" that do not include some form of moderation >are stopgaps that will have no long-term usefulness as the network >continues to grow. I feel that "solutions" that DO include moderation may work on some nets, but won't work on USENET. You want a different net, fine, but I want the decision to read or not read an article in the hands of the reader. I CERTAINLY wouldn't want a newsgroup moderated by Lauren, if only because he and I disagree on everything and therefore the stuff I'd consider interesting wouldn't get in.... What we need to do is build a system that makes it easier to screen messages and less likely to mispost messages. The only person I really trust as moderator is me, and I wouldn't trust me as moderator for anyone else. The one thing I think we NEED to save if we want USENET to survive is the spontaneity, and stargate and other moderated schemes lose that. Without the ability to post without supervision, you no longer have USENET, and I think you lose a lot of the appeal and wonder that makes this net as much fun as it is. (wow, Lauren and I disagree again. Suprise suprise). Despite that, I've been happy to try to move stargate forward and kept my disagreements to myself -- let a project stand or fall on its own merits... Altruistic? hell, no. Just interested in getting a better net, and treating people right. Ideas well ahead of their times, it seems. >Keyword systems do nothing to help control our traffic and cost >problems. In fact, they make it MORE likely that discussions will >mushroom off in bizarre directions (no pun intended) and make it HARDER >for individual sites to control what they wish to send or receive. Who says? NNTN isn't nearly at the detail of design to say one way or another. Why can't we simply set things up so that specific keywords don't get forwarded? (things like flame, bizarre, jokes, lauren, whatever...) That is actually BETTER than the current situation because you exclude anything posted to a specific keyword. Now, if something is cross posted to an excluded group and a group you do get, you get the posting, which means that turning off a group doesn't do as well as you would hope -- especially with heavily cross posted groups like net.flame. >Keyword systems don't even necessarily help find items of interest, >due to the four error modalities already discussed in >previous messages. I don't see that this is better or worse with keywords than it is with the current situation. It is just as difficult to track down the appropriate place for things with the current group structure as it is with keywords. Keywords may well not be better than newsgroups, but they also aren't worse, and they give you a level of flexibility that is difficult to design into our current structure. Well, have at me, Lauren. I've been looking forward to tossing a posting at you like the one's you've been lobbing at me for a long time. I don't expect it to be very productive, but I certainly feel better for it... Hope you don't mind a little of your own medicine... Tired of trying to act like a gentleman in a room full of pigs, call me: Ishmael -- :From the caverns of the Crystal Cave: Chuq Von Rospach Currently: nsc!chuqui@decwrl.ARPA {decwrl,hplabs,ihnp4,pyramid}!nsc!chuqui Soon to be: ..!sun!<somethingorother> Our time is past -- it is a time for men, not magic. Come, let us leave this world to the usurpers and rest our weary bones....
preece@ccvaxa.UUCP (10/14/85)
> /* Written 4:58 pm Oct 10, 1985 by sewilco@mecc.UUCP in > ccvaxa:net.news */ What really is necessary is a widely available index > of concepts which could be used by posters and moderators as a > standard. Roget's Thesaurus has one possibility for abstract concepts, > but falls short on the many objects which we discuss. Anyone seen > anything better? Or start the monthly mod.keyword.list? ---------- Roget's Thesaurus is, obviously, inadequate for technical topics, but it's in the right direction. The two major bibliographic databases in this area are Engineering Index and Inspec's Computer and Control Abstracts. Both use several indexing schemes, including controlled vocabulary and natural language indexing applied by indexers. The ACM has a more limited CS classification scheme used for Computing Reviews. Indexing using a large controlled vocabulary is not a task for a beginner. It takes a LONG time to learn how the vocabulary works and how to determine the most appropriate indexing for a particular document. -- scott preece gould/csd - urbana ihnp4!uiucdcs!ccvaxa!preece
lauren@vortex.UUCP (Lauren Weinstein) (10/14/85)
The temptation to ignore Chuqui's last outburst is considerable. While I've certainly not hidden my feelings that keyword-based news schemes are not appropriate for Usenet, I can't recall ever saying that Chuqui or others shouldn't work on it if they so desire. Nor can I recall calling him or others "idiots," "pigs," or other similar terms that he found it necessary to use in what he himself called his "obligatory childish behavior." Upon reflection, I suspect that the biggest problem is that many persons are simply not familar with the work and problems already done in the areas of query/response and keyword systems. It isn't as if it's a new invention. Such systems have existed for quite a long time, and a considerable body of published work exists that clearly point out the positive and negative aspects of such systems. One previous poster on this topic mentioned some of the formal terms of these systems--I've steered clear of the formal terminology since I figured most people wouldn't be too interested, but perhaps some formal discussion of the theory and practice of these systems is in order at some point. In any case, let's briefly address a few issues from Chuqui's message: > My hope on going public with my NNTN project was to try to get some > reasonable feedback. As seems to be typical of most of the network, and of > Lauren in particular, all I've gotten are rather childish attempts at > minimizing any attempt to do something positive for this beast we > laughingly call a network. How many people remember some of the incredible pressure I was under when I first proposed Stargate? It makes the sorts of messages we've seen here on the topic of keywords seem like nothing by comparison. I never saw any message that said, "People who work on keyword systems are idiots." What I did see were messages (some of them written by me) that said, "Keyword systems (as described) won't work in the Usenet environment and may create far more problems than they would solve." Personal opinion to be sure. Not based on a desire to see Usenet die, but rather based on the desire to avoid seeing new problems created. > Well, there is an implied "Chuqui's a young whippersnapper, listen to me > because I've been solving these problems since he was in diapers" comment in > there. Well, I could make a snide comment about old and senile hackers, but > that doesn't contribute to the situation... I'll even agree that his > arguments aren't new. Old arguments aren't neccessarily right, they're just > old.... Then again, old arguments might be right, too. Ignoring history is often a serious mistake. When topics have been discussed in the past, and when a body of technical work concerning the topic of interest already exists, a great deal of time can be wasted if a person chooses to simply ignore all that has come previously. Whether this is done on purpose or through naivete doesn't much matter--the result is usually the same. An important issue revolves around how much time is spent "re-inventing" the wheel, only to come up against the same old problems, in such situations. Another issue concerns whether or not well-intentioned efforts that might have a short-term benefit create additional long-term problems. > Stargate... Stargate is designed to be a medium and long-term alternative for collecting and transmitting information. It isn't meant to provide exactly the same sorts of "services" we get now from Usenet. As an aside, the project is going quite well, and I hope to have some significant announcments regarding service organization and availability in the fairly near future. I hope to have more hardware available soon to allow more sites to receive Stargate transmissions--mass production of the decoders is already underway, and the prototype "buffer box" is under construction now. At the same time, various non-technical discussions relating to the evolution of the project from an experiment to a service continue. More in net.news.stargate as developments warrant. > I feel that "solutions" that DO include moderation may work on some nets, > but won't work on USENET. You want a different net, fine, but I want the > decision to read or not read an article in the hands of the reader. I > CERTAINLY wouldn't want a newsgroup moderated by Lauren, if only because he > and I disagree on everything and therefore the stuff I'd consider > interesting wouldn't get in.... What we need to do is build a system that > makes it easier to screen messages and less likely to mispost messages. This is a fine short-term concept. And keywords used in ADDITION to newsgroups might be of use in that area. My primary objection to keyword systems appears when people want to REPLACE newsgroups with keywords. In essence, newsgroups provide something that has been found to be critical in real-world keyword systems--keyword list control. That is, a newsgroup is, in essence, a base keyword that all users are required to choose from which provides a conceptual "anchor" for the message. If they want to add additional keywords also (as some people do now) that's OK... people with the appropriate software may choose to use or ignore those keywords (of course, they should keep in mind the keyword error modalities we've discussed previously when making such a decision). But without some sort of "forced keyword selection" (which is what newsgroups really are) we're faced with a serious problem. Sites are put at the mercy, when trying to decide how to spend their time and money, of the keyword choices of individual users. Variation of keywords has been shown to be one of the biggest problems with uncontrolled keyword-based systems. Not only does it make it difficult to find articles of interest, but it makes controlling what you DON'T want to see very difficult. There are just TOO MANY KEYWORD POSSIBILITIES in an uncontrolled system, even assuming that all users choose keywords conscientiously, accurately, and in detail. Enough on this point. I'll just add that virtually all "successful" keyword-based systems have a central authority that controls keyword use. Often this authority actually chooses the keywords, or else "corrects" poor user keyword choices before letting articles enter the database. Often official keyword lists are also published or made otherwise available so that users will see what sorts of words will appear and thusly allow intelligent usage of keywords on the part of both posters and readers of articles. Keyword control is CRITICAL. I can't emphasize this enough. I don't see any way to make this work in a distributed environment such as Usenet. Newsgroups are the underlying structure that holds the current net together to the extent that it is now. > The only person I really trust as moderator is me, and I wouldn't trust me > as moderator for anyone else... This would be OK, if there were no costs associated with traffic. Let's take an example. Let's say we had this "perfect" AI program that could precisely and accurately filter all our incoming traffic and show us ONLY what we TRULY wanted to see. It's never fooled--it never makes mistakes. EVERYONE is running it (so we don't have to worry about the poor slobs running older software who have no way to do automatic filtering!) Would this solve Usenet's problems? Of course not. The problem is that getting the traffic to the AI program costs money, time, and other (hardware) resources (disk, dialups, CPU cycles, etc.) What happens when we have 100,000 sites on the net? Can't happen? Well, probably not, for much the same reason that we're unlikely to reach a population of 100 billion on this planet--everything will collapse long before then. But traffic continues to grow, and the percentage of useful traffic, by virtually ANY reasonable definition, will continue to decline. When ANYBODY, ANYWHERE on the net, can post a message to EVERYONE, we start to look into the face of problems that will be nearly exponential in nature. To take an old example: Someone asks the network what "foo" means. What do we do when a few thousand people respond? Or more? Even given the fancy AI program that can show us only the "meaningful" responses--we've still paid to send all those answers (99.9% of which will impart no new useful information) throughout the world. As the net grows, this sort of behavior will simply become impossible to support, from neither a time nor resource standpoint. --- The course that Usenet is taking is becoming very clear. It isn't even necessarily a BAD course--but it's in keeping with evolution. What we're going to see is increased fragmentation. The recent announcement by utzoo regarding newsgroup cutoffs is an example of such fragmentation in action. As the volume of materials continues to grow more sites will be forced to make hard decisions about what they can afford, under various criteria, to support. To the extent that some sites feel wealthy enough to continue taking full traffic in an ever-expanding network with 10's of 1000's of sites, they will be free to do so. After all, any site can arrange to call any other site and pass whatever articles they wish. Other sites may wish to try alternatives (e.g. Stargate) which will offer what will hopefully be a far more cost effective and lower noise information flow. The model of rapid-turnaround "information conduits," with users submitting items for "publication through the conduits," is the one I like to use for Stargate. To the extent that people like or dislike the way these conduits are managed the service will evolve and change. Persons with the resources to support all or part of the free-for-all on Usenet can do so also, of course. Participating in Stargate doesn't require giving up everything else. It will always be up to the individual sites to make these decisions. Usenet won't just DIE. But its nature will gradually continue to change as more sites join the fray, and as the volume of postings continues to increase. The noise level WILL continue to rise in an unmoderated environment, and traffic will continue to grow rapidly to the extent that backbone cutoffs do not occur. Article filtering techniques may have some short term benefit--but only if they do not make it MORE difficult for sites and users to accurately control their traffic, costs, and time. Traffic growth, however, will prevent such techniques from being a long-term solution to what are really systemic problems in Usenet itself--problems that have appeared as netnews grew far beyond the size envisioned by those who created it. --Lauren--
hes@ecsvax.UUCP (Henry Schaffer) (10/16/85)
> Upon reflection, I suspect that the biggest problem is that many > persons are simply not familar with the work and problems already > done in the areas of query/response and keyword systems. It isn't > as if it's a new invention. Such systems have existed for quite a long > time, and a considerable body of published work exists that clearly > point out the positive and negative aspects of such systems. > > ... I'll just add that virtually all "successful" > keyword-based systems have a central authority that controls keyword > use. Often this authority actually chooses the keywords, or else > "corrects" poor user keyword choices before letting articles enter the > database. Often official keyword lists are also published or made otherwise > available so that users will see what sorts of words will appear and > thusly allow intelligent usage of keywords on the part of both posters > and readers of articles. Let me suggest something that each person can experience. Go to your university or company library and talk to the reference (or other appropriate) librarian about doing a computer based search of some bibliographic database in a field in which you have an interest. Choose a topic with which you are familiar so you will be able to estimate what percentage of the material you get is undesired, and what percent of the literature which is relevant doesn't show up in answer to your request. You'll probably have to sit down with pages of rules and keywords, and with help, and with several iterations you may very well get what you know is a good answer. (To be reasonable about this, you have to choose a *topic* that has some coherence but is not too specific. It is easy to say "Give me all articles about unix", or "give me all the articles authored by xyz since 1980". I'm thinking more of the type of topic on which you would actually be doing research - e.g., comparative performance of different datacom network architectures. What keywords would one use for that? "data communication" would include much of what you want, and a *ton* of other things. How about anything with "queue" or "queueing" as a keyword. Whoops, look at all that telephony and industrial engineering stuff. Let's add star, ring, ..., whoops again- lets qualify them with "data communication", whoops, we lost many articles which were so obviously on data communications, but were labeled "computer communications", or whatever. Last time I did a search, I was given a *book* of keywords, and was told that authors had been *required* to describe their research using those (and only those.) Therefore, I could only request from that book, along with operators to allow generalizations like other endings. After you have gone through a few non-trivial searches, then let's continue this discussion. --henry schaffer Disclaimer: I really don't want to get into the ad hominem aspects of this discussion, and refuse to comment directly on what view(s) I personally favor.
tim@k.cs.cmu.edu.ARPA (Tim Maroney) (10/17/85)
Aw, poor Chuqui. Someone criticized his grandiose proposal for large-scale revision of USENET. Chuqui would never mindlessly flame people who proposed new features, would he? No, of course he wouldn't. After all, he's the good guy. -=- Tim Maroney, CMU Center for Art and Technology Tim.Maroney@k.cs.cmu.edu uucp: {seismo,decwrl,etc.}!k.cs.cmu.edu!tim CompuServe: 74176,1360 My name is Jones. I'm one of the Jones boys.
henry@utzoo.UUCP (Henry Spencer) (10/19/85)
A friend of mine who recently attended a talk by Mike Lesk came back with an interesting tidbit of information that is of some relevance to this discussion. Apparently Lesk has done some work on keyword choice, and has made an interesting discovery: of the various methods he tested for automatic keyword selection, distinctly the most effective was to use the first 30 words of the main text and *ignore* any author-supplied keywords. This is thirdhand info, and I don't know what other methods he tried, but it's a nice comment on the ability of authors to pick good keywords without careful guidance. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
lkk@teddy.UUCP (10/21/85)
In article <6061@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: > Apparently Lesk has done some work on keyword choice, >and has made an interesting discovery: of the various methods he tested >for automatic keyword selection, distinctly the most effective was to >use the first 30 words of the main text and *ignore* any author-supplied >keywords. This is certainly true with mail. Our system here informs you of new incoming mail by printing the first two or three lines on your screen. Its almost always possible to tell exactly what the message is about from that information. -- Sport Death, Larry Kolodney (USENET) ...decvax!genrad!teddy!lkk (INTERNET) lkk@mit-mc.arpa Life is either a daring adventure, or nothing. - Helen Keller
lauren@vortex.UUCP (Lauren Weinstein) (10/22/85)
I think I'd agree that just using the words from the first part of a message (if we could find and ignore all the included text in different forms from older articles, the cute opening lines and line eater bug lines) might be better than people trying to pick their own keywords. (Actually, skipping the previously included text might not work, since often people add comments on the end of such text that would have no meaning without the included text. This means that you're stuck trying to keyword both the included text (again!) and the "new" text.) But even if you did the above and did a fairly good job of it, it's still not good enough. It's only "better" since letting people (in an uncontrolled keyword environment) pick their own keywords is SOOOO bad. We (humans) can tell what the meaning of a message is (much of the time) quite quickly because we do considerable analysis of the text while we're reading! We automatically ignore the "extraneous" words in a manner that would be difficult for even a sophisticated program to accomplish. And the big problems still remain. People's random word choices when they write their text result in massive keyword list expansion. Without centralized keyword control, making good keyword search choices remains exceedingly difficult (even with such control, it's still very difficult). Also, analysis of word formats, pluralisms, usage, etc. still must be considered and are non-trivial problems. All four of the keyword error modalities still exist, as do the related control and coordination problems. Also, we must not forget the percentage of messages that will horribly fail the "beginning of message" meaning test and generate all sorts of "noise" into the keyword systems. No, it just doesn't fly. What I think was really being said was that people's choices for keywords are SO BAD that EVEN taking words from the first part of text is better. But that doesn't mean that taking those words solves the fundamental problems with keyword systems (which we've hashed over quite extensively in this group as of late!) The only way to make a keyword system work at all, even "moderately" well, in any environment, is to have a centrally controlled and organized keyword base, with keywords being carefully selected and organized by people who have the time and inclination to do such work. I don't see this happening in the Usenet environment, for technical, logistical, and also "sociological" reasons. Also, the above doesn't even start to address the problems that any keyword system (that might be designed to replace newsgroups) would cause for traffic control at sites that needed to limit certain kinds of traffic. Nor does it address the fact that even WITH carefully controlled and "professionally" chosen keywords such systems take a great deal of time and practice to use even minimally well. I'd like to second the idea that someone else already posted. If you think you like the idea of keyword systems but are personally unfamiliar with the way REAL keyword systems work--go to a library and try out their search services. Keep in mind that they operate with a VERY carefully controlled keyword base--and be sure to pick a topic where you'll know how many articles you're MISSING during your searches, and how many UNDESIRED ones you're getting as well. It may be an interesting experience for you. --Lauren--