[news.admin] keyword-based news

henry@utzoo.UUCP (Henry Spencer) (07/07/87)

> A few years ago I, and some others were arguing fairly strenuously that
> some kind of keyword based news reader was required to cut down on the
> amount of chaff you have to search through to find the odd kernel of wheat.
> In the end, the discussion went the way of the Dodo...

Well, there were reasons for that.  I and some others were counter-arguing
fairly strenuously that keyword-based news readers will not work unless the
keywords are well-chosen, which they wouldn't be.  The successful keyword-
based retrieval services maintain tight central control over their keyword
list and often use experts to assign keywords to new material.  There is
just no way to make that work on Usenet.  What's more, the few studies that
have been done on retrieval efficiency show that users *think* they are
getting 80% or so of relevant material, while the real number is more like
25%.  That is, even well-run keyword-based systems show you only about one
in every four kernels of wheat.  It's pretty, but it just don't work.
-- 
Mars must wait -- we have un-         Henry Spencer @ U of Toronto Zoology
finished business on the Moon.     {allegra,ihnp4,decvax,pyramid}!utzoo!henry

brad@looking.UUCP (07/07/87)

In article <8262@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Well, there were reasons for that.  I and some others were counter-arguing
>fairly strenuously that keyword-based news readers will not work unless the
>keywords are well-chosen, which they wouldn't be.  The successful keyword-
>based retrieval services maintain tight central control over their keyword
>list and often use experts to assign keywords to new material.  There is
>just no way to make that work on Usenet.
>
>Mars must wait -- we have             Henry Spencer @ U of Toronto Zoology

My original proposal of K news (must be almost 5 years ago now!) did
suggest user generated keywords.  That idea comes from a smaller net, and
Henry has a point that it might not work well now.

What if we take the other features of K news but require some central
authority (a "moderator?") for the creation of keywords.

Right now there are two reasons to NOT create a newsgroup:

	1) The news software did not envision so many groups, so there
	   are hard memory limits on many machines on the number of groups
	   in the active file

	2) The net might get too confusing with too many groups

	3) People aren't interested enough in the idea to warrant spending
	   money sending the stuff all over the world

	(Note that "volume would be too low" is not a reason at all.  In
	fact, it's an anti-reason.  The lower the volume in a group the
	better.  Today's high-volume groups are useless to most people.
	They just don't have time to wade through them.)

K news was designed to get rid of reason #1.  With good reason, for the
high volume groups that are the result of reason #1 cost everybody a lot
of money, and waste a lot of time for the people who read them.

Reason #2 can't be solved well with software.  It is a trade-off we
must pay.  The more specific news classification is, the harder it is
to comprehend it all.  The less specific it is, the noisier groups get
with random postings and other crap.  You don't want a net with only
one group called "misc" and you don't want 20,000 groups either.

Reason #3 was solved by the use of K news's powerful subscription file
as a distribution file.  Allowing convenient site subscription, minor
keywords could be set up to limit distribution to only those sites with
readers.  This would make distribution MORE efficient than a mailing list.

-------------

So other than the S.M.O.P. involved, why not K news?  It solved (5 years
ago) just about every major problem we have today.  Because of the lack
of software restrictions, the keyword creation moderators would not have
to be particularly controversial.  Instead of asking "why create this
group?" the question would be "why not?"  Keyword moderators would most
ensure that keywords followed a good pattern, and that keyword association
dictionaries were kept up to date.

(Most people on this list have seen the K news plan, so I won't post it
unless I get a lot of requests.)
-- 
Brad Templeton, Looking Glass Software Ltd. - Waterloo, Ontario 519/884-7473

taylor@hplabsc.UUCP (07/07/87)

As a side note, I hacked up a newsreader that is based purely on keywords 
to see what it would be like.  It took all the words in the Subject: 
Summary: and Keywords: lines, 'uniq'ed them and removed 'noise' words (e.g. 
the, and, a, <nf>, etc) and then logged 'em in a file as they arrived on 
the machine.  Then the intrepid user would say "I want to read news about x,
y, and z" and be shown the news *independent of what groups they were posted 
in*.

I used it for a few days and found that it worked QUITE well and that the 
biggest problem I could see was that it would become very difficult to 
figure out what group(s) to post a completely new article to since users 
of this knews system would unlearn the distinction between newsgroups.  
This isn't good because the program and users would have to live in harmony 
with the rest of the net...

A fun experiment showing that my theories that the concept of grouping
articles by a small number of newsgroups is indeed as archaic as it
seems and that I found articles and discussions in groups that I had
never even read because they were indeed keyworded (see above) correctly.

And as to the stuff that isn't keyworded correctly, well, if you think
about it, as more and more people were to use a system of this nature
the articles would become better and better keyworded since if you are
going to go to the trouble of WRITING an article, you certainly want to
make sure that the maximal number of people READ it, right?  (this can
be helped by some decent frontend software too - stuff that allows the
user to edit the subject line, prompts for a "summary line", and perhaps
does a crude first pass automatic keyword list).  The key is that it is a
lot easier for people to modify something than create it, typically.

*sigh*  I can imagine the hostile remarks this posting is going to 
generate.  We've had, as people have pointed out, this discussion before.
A great number of schemes have been proposed to the net, including this
keywording, Webbers' multiple moderators, Fairs' accolades, and such, and
somehow we keep ending up with these artificial newsgroup boundaries,
articles that are more likely to be cross-posted than not, and discussion
threads that are doomed to follow the 'base note' regardless of if we are
still on topic or not...it's always the lowest common denominator.  Maybe
there's a lesson to be learned in all this??

Anyway, for what it's worth...I shall attempt to find a few free evenings
and get my knews reader up to a sufficient state to allow me to post it
to net.sources (errr, to whatever group is appropriate, since It Is Obvious
that Unmoderated Groups are Evil (even though I have proposed a scheme to
alleviate the problems cited with the old unmoderated newsgroups)).  *sigh*

				From the far corners of the universe,

						-- Dave Taylor

allbery@ncoast.UUCP (Brandon Allbery) (07/11/87)

As quoted from <2185@hplabsc.UUCP> by taylor@hplabs.HP.COM (Dave Taylor):
+---------------
| As a side note, I hacked up a newsreader that is based purely on keywords 
| 
| And as to the stuff that isn't keyworded correctly, well, if you think
| about it, as more and more people were to use a system of this nature
| the articles would become better and better keyworded since if you are
| going to go to the trouble of WRITING an article, you certainly want to
| make sure that the maximal number of people READ it, right?  (this can
+---------------

From experience:  Someone may, in an article keyworded to A, B, C, and D,
make a reference to E which is so minor as to not deserve keywording...
until it turns out that that reference answers another person's question,
but that person never gets to see it on a keyword search for E.  In fact,
you can change "may" to "will"; it happens all the time.

The only way I see to get keywords working is to potentially use every word
in an article (both header and body) that is not a syntactic particle as a
keyword, after standardizing case and attempting to deal with spelling and
prefixes/suffixes.  This doesn't strike me as being very fast, space con-
servative, or (without either a better AI program than we've got or a
(horrors!) moderator choosing the keywords) likely to be correct.  (And
even the moderator can mess up.)

Of course, omitting syntactic particles makes it difficult to find the
article in what is now soc.lang.english (if there is such; I haven't checked)
on uses of the word "the"....

+---------------
| and get my knews reader up to a sufficient state to allow me to post it
| to net.sources (errr, to whatever group is appropriate, since It Is Obvious
| that Unmoderated Groups are Evil (even though I have proposed a scheme to
+---------------

comp.sources.misc

The problems with any unmoderated scheme are amply demonstrated by the
bogus posting by richard@bigtuna.UUCP of a month back.  It doesn't matter
WHAT you do, people will scream bloody murder if they can't use net.sources
as comp.sources.d.  (There were more discussions in net.sources than in
comp.sources.d during its final two weeks, even ignoring the discussions
about net.sources becoming moderated.  How do I know?  Erik Fair jumped the
gun and my mailbox was suddenly filled with 15 duplicate copies of every
message sent to net.sources.  Once I eliminated the duplicates, the amount
of non-source in net.sources was absolutely disgusting.)

I, too, will try to find time to implement the keyword scheme I discussed
above:  I'm interested in seeing how bad it really is.  I hate to imagine
the keyword database, though....
-- 
[Copyright 1987 Brandon S. Allbery, all rights reserved] \ ncoast 216 781 6201
[Redistributable only if redistribution is subsequently permitted.] \ 2400 bd.
Brandon S. Allbery, moderator of comp.sources.misc and comp.binaries.ibm.pc
{{ames,harvard,mit-eddie}!necntc,{well,ihnp4}!hoptoad,cbosgd}!ncoast!allbery
<<The opinions herein are those of my cat, therefore they must be correct!>>

allbery@ncoast.UUCP (Brandon Allbery) (07/11/87)

It has occurred to me that I should provide an example of my assertation that:

As quoted from <2855@ncoast.UUCP> by allbery@ncoast.UUCP (Brandon Allbery):
+---------------
| From experience:  Someone may, in an article keyworded to A, B, C, and D,
| make a reference to E which is so minor as to not deserve keywording...
| until it turns out that that reference answers another person's question,
| but that person never gets to see it on a keyword search for E.  In fact,
| you can change "may" to "will"; it happens all the time.
+---------------

Macro-example (bad keywording):  Find, in the permuted index for the sVr2
manuals, the man page for ftok().  (Hint:  the word "ftok" is not in the
permuted index at all.)

Micro-example (small missed reference; assume, for this case, that keywords
are cross-referenced via the "SEE ALSO" section of a manpage):  Find the
reason that a program under System V which wait()s for a child can act
strangely under multiprocess adb (Plexus).  (Hint:  there's no reference
to ptrace(2) in wait(2).  With keywords, selecting the right ones to look
up articles can require you to know the answer already beforehand.)  [All
right, this one's slightly unfair; I don't know that multiprocess adb is
in other variants of System V; but it's a clean miss; and wait(2) STILL
doesn't point to ptrace(2).  If anyone's interested, if you set a break-
point in the child, the parent might return from the wait() with a status
of Stopped under certain conditions.]
-- 
[Copyright 1987 Brandon S. Allbery, all rights reserved] \ ncoast 216 781 6201
[Redistributable only if redistribution is subsequently permitted.] \ 2400 bd.
Brandon S. Allbery, moderator of comp.sources.misc and comp.binaries.ibm.pc
{{ames,harvard,mit-eddie}!necntc,{well,ihnp4}!hoptoad,cbosgd}!ncoast!allbery
<<The opinions herein are those of my cat, therefore they must be correct!>>