[net.news] Keyword News Usenet Request For Comment 001

bstempleton@watmath.UUCP (06/11/83)

Show Mercy - it's the first draft.

URFC 001                                              K NEWS


                    Keyword News System
                     By Brad Templeton
               USENET Request For Comment 001



     For some time people have  been  using  a  news  system
based on newsgroups.  This is a short outline of my proposal
for a news system based on a classification system I  called
keywords.  The only essential difference between a newsgroup
and a keyword is that the Keyword news system (or K news) is
designed  so  that  there  is a very small overhead for each
keyword.   It  is  thus  possible  to  have  thousands   and
thousands of active keywords with little overhead.

     It is my feeling that  several  problems  have  emerged
with the old newsgroup style system.  These follow:

(1)  Due to the limited number of groups, there is  a  great
     deal  of  traffic  concerning  what  articles belong in
     which groups  and  whether  certain  groups  should  be
     created  or  destroyed.  Under K news, there is no such
     discussion.  If you want a new keyword, you create  it.
     If you want to use a name that is long and descriptive,
     you can.  If discussions go under several keywords,  it
     is easy to add them to your list.

(2)  The limited number of groups also creates  groups  like
     "net.misc"  and  "net.general"  which  are difficult to
     work with.  K news eliminates the need for net.misc and
     allows easy renaming of net.general.

(3)  Current systems only allow an "or"ing  of  groups  when
     dealing  with multiple groups.  In K news, it is possi-
     ble to request articles that deal with a  set  of  key-
     words.   ie. one can ask to be shown only articles that
     contain both the  "science  fiction"  keyword  and  the
     "movie" keyword.

(4)  Current systems do not allow grouping all followups  to
     a given article together, or sorting articles according
     to posting date.  K news provides this because it  uses
     sort(1) on the complete list of articles to be seen.

(5)  Current news systems are slower  than  they  should  be
     because  they  must  scan  each  newsgroup a users sub-
     scribes to to see if there is  news.   Knews  does  not
     have this problem.



     One main idea behind K news stems from  the  fact  that
the  average  news  reader  normally reads the news that has


Brad Templeton                                             1







URFC 001                                              K NEWS


arrived since news was last read.  Thus, instead of scanning
directories  and keeping track of what has been read, K news
scans a history file and keeps track of what  has  NOT  been
read.   In a given session, the history file is scanned from
the point in time when news was last read.  In  addition,  a
file  of  articles  not  read  from  the previous session is
scanned.  The user may  request  to  see  the  old  articles
first, or to have them merged in with the newer ones.  Find-
ing out what to read is a simple matter of  scanning  a  few
files and should be quite fast.

_1.  _T_h_e _K_e_y_w_o_r_d _E_n_v_i_r_o_n_m_e_n_t

     K news can solve the B news  problems  by  promoting  a
different environment with keywords.  First of all, the dis-
tribution of an article is taken out of  the  keyword  name.
This  means  all  keywords are valid over all distributions.
The fact that there is an "auto" keyword means you can  post
an  auto article to netwide, statewide or even local distri-
bution.  This should  cut  down  on  the  number  of  people
advertising  their  cars to "net.auto" because the only auto
group has netwide distribution.

     An article will have several keywords.  The K news sys-
tem  will  probably  insist  on members from certain sets of
keywords be there.  For example, there should be a distribu-
tion  keyword  with  any  article  that is not local.  There
might be a "followup"  keyword  on  any  followup,  although
these  can be detected from their "References" string.  Key-
words like "spoiler" and "flame" can be put with articles so
that people can request not to see them.  (Rediculous groups
like net.flame go away.)

     It seems that all articles seem to fall into a  certain
set of classes.  These classes are "query", "original infor-
mation", "reprint", "opinion"  and  "followup".   There  are
some  sub-classes,  such  as "flame" (a type of opinion) and
"source code" (a type of original information).  It might be
a  good idea to insist that all posters provide one of these
keywords, with the followup keyword being automatic.  Thus a
reader can shut off all queries or all opinion articles.

     Groups like "net.misc" will no longer be  needed.   Any
new discussion can easily rate a new keyword, from "big mac"
to "socks in hyperspace".  The group "net.general" is  still
a  bit  of  a problem, but it can now be replaced with some-
thing like "announcement for all users", and there  will  be
very  little  implicit cry to put the article in the netwide
distribution.  There will still be problems, but  they  will
be reduced.

     It's also possible that we will still get a lot of  the
"You  posted  that  to the wrong keyword" type stuff.  It is


Brad Templeton                                             2







URFC 001                                              K NEWS


hoped that since adding a keyword to your subscription  list
will  be quite easy, people will not complain too much about
this.  Even so, it is still possible that  some  utility  to
help users select keywords will be required.  Each site will
keep all known keywords in a DBM type file.  (This  will  be
the  total  overhead  for  each keyword.) The DBM file entry
might contain who first used the keyword, a one  line  entry
describing  it, and its newsgroup mapping on a B to K inter-
face system.  A simple utility might scan a  user's  article
for  any of the keywords that occur in the text of the arti-
cle and suggest them as possible entries.  In  addition,  if
the  user  suggests a new keyword when posting an article, a
search for keywords that the new one could be  an  incorrect
spelling of would be in order.

     Since the keywords are the  important  thing  that  get
copied  over in a followup, the subject line will not remain
the same.  One current problem under B news is that you  get
discussions  that  wander under the same subject line.  This
subject soon becomes meaningless.   Any  followup  generated
with  K  news  will have an entirely new subject line, since
both the keywords and the References string will provide  an
indication of what is a followup to what.




_2.  _T_h_e _K _n_e_w_s _i_m_p_l_e_m_e_n_t_a_t_i_o_n   The K news  system  consists
of three major parts.

_2._1.  _N_e_w_s _I_n_p_u_t

     Part one is the news entry system, much like the  inews
part  of  the  B  news system.  With this inews, whenever an
article comes in, it gets placed in a  file  somewhere.   It
can be any file, and can even (on something like a unix sys-
tem where seeks are possible) be placed  on  the  end  of  a
file.   If another news system like B news is running, it is
highly likely that B inews will place the file in one of the
newsgroup  directories  in  /usr/spool/news.   If  not, some
other program will find a place for  the  file,  perhaps  by
hashing or by using something based on the article-id.

     Whatever the program is that places the  article  in  a
file,  the  filename  should  be passed to the K news pickup
program.  This program will take the  article,  and  examine
the header.  Important information about the article will be
written to a special history file.  This  will  include  the
keywords  associated  with  the  article,  the "References:"
string of the article plus its message-id, the date of post-
ing,  the  pathname  of the file containing the article with
optional seek address and length, and  finally  the  subject
line.  Note, by the way, that in the case of a followup, any


Brad Templeton                                             3







URFC 001                                              K NEWS


extra keywords that were not in the  original  article  will
have to be placed in an extra field so they are not involved
in the sorting that groups articles with their followups.

     History files will be  maintained  on  a  one  per  day
basis,  in a special history directory.  Each history file's
name will be formed from the date  for  that  history  file.
(Perhaps  in  days since the birthday of the net, or perhaps
in the form yymmdd.)  There may be a new history  file  each
day,  each week, or even every hour as the site requires.  K
inews will query the date  and  time  from  the  system  and
decide which history file to append to.

     If working with B news, there is no need  to  have  any
system  for  forwarding  to  other sites.  If K news is done
properly, however, B news will go away  and  such  a  system
will  be needed.  It should be noted that on sites that sup-
port both B news and K news, a mapping table that maps news-
groups to associated keywords and vice versa will be needed.
With new modifications to uux  possible,  It  is  invisioned
that  each  site  receiving  news  from  a K news site would
essentially have a .newsrc like file on the forwarding site.
This  is to say that each site would be in the same position
as a user, with a keyword subscription list and  a  list  of
unread  articles.   Forwarding could either be done by using
the same process a user does to read news when a transfer is
made,  or  by  having  the K inews check each article in the
subscription files for  known  sites.   The  first  way,  of
course, is much more efficient.

_2._1._1.  _B _t_o _K _i_n_t_e_r_f_a_c_e

     The B to K news interface will be quite difficult while
they  co-exist.   Forwarding B articles into the K system is
the easier part.  All that is required is  a  mapping  table
that  maps  newsgroups  to  keywords.  A site that runs both
would have to have their administrator  figure  new  entries
for  the  table when a new newsgroup is created.  It is also
possible that any item in a newsgroup that does not have  an
entry  in  the  map  table  simply  be mapped to the keyword
"Newsgroup ngname".   For  example,  articles  in  net.space
would be mapped to "Newsgroup space".  If there is a mapping
table, they will likely be mapped to the keyword "space"  as
well.  It is possible you will want to have all B news arti-
cles tagged with a "B News" keyword  so  that  they  can  be
spotted.

     The other way is more difficult,  since  K  news  users
will  be  creating new keywords all the time.  It would be a
very difficult task for the administrator of a site to  keep
adding  entries  to the map table.  It is thus proposed that
during the interim, whenever a K news  user  creates  a  new
keyword,  that user must provide a newsgroup mapping for it.


Brad Templeton                                             4







URFC 001                                              K NEWS


For example, if I am the first to use the "Frank Zappa" key-
word, I will be asked to provide a newsgroup for it, perhaps
net.music.  In this case, a new header item giving  the  map
would  be added to the article, and any interface site would
recognize this line and put the new mapping in the  K  to  B
map.  The problem with this is that it requires K news users
to be  aware  of  B  newsgroups  which  they  will  consider
anachronisms  in  their  own system.  On the other hand, the
idea of newsgroups as more global level  keywords  has  some
appeal,  since  it allows grouping of messages in one direc-
tory for doing greps.

_2._1._2.  _D_i_s_t_r_i_b_u_t_i_o_n

     Distribution will be done by special keywords,  perhaps
those ending in the string "_Distribution".  For example, an
article  to   go   to   the   net   would   be   posted   to
"Netwide_Distribution"  and  one  for  Canada would go under
"Canada_Distribution".  Thus site subscription  files  would
be quite simple, involving only Distribution keywords.

_2._2.  _R_e_a_d_i_n_g _N_e_w_s - _P_h_a_s_e _O_n_e

     Reading news with K news is broken up into  two  parts.
The  first  part is general, and contains no user interface.
This is done so that people who wish  to  design  their  own
user interfaces can do so easily.

_2._2._1.  _S_u_b_s_c_r_i_p_t_i_o_n _L_i_s_t

     The first program will maintain two files.   The  first
is  the  subscription  list.   This tells which keywords and
discussions the user is interested in.  This will be a  list
of keywords subscribed to and boolean expressions built from
them.  Keywords are actually text strings, but they may  not
contain  a  special set of characters which are used to del-
imit them.  These characters are ":", ",",  "!",  "[",  "]",
"&",  "|",  "*",  "(",  and ")" to start with.  (More may be
thought of while the program is being written.)   Each  line
consists  of  a  keyword  pattern  to  describe  the  user's
interests.  In addition, some special lines in the subscrip-
tion  file  will tell what the user wants done with articles
from the previous session, and possibly special options.

     A typical subscription line lists  a  keyword  pattern.
For example, the line:

    science fiction

Asks for all articles with the  keyword  "science  fiction".
Quotes  may be required, but this is a matter to be decided.
It also makes sense that any blank fields in  a  keyword  be
compressed to one space so that typos do not cause problems.


Brad Templeton                                             5







URFC 001                                              K NEWS


The line "!star wars" would ask that no  articles  with  the
keyword  star  wars  be  shown.  We can also ask for "Ronald
Reagan & taxation" to ask for all articles with both of  the
keywords show.   Similarly "Ronald Reagan & !taxation" shows
us all articles about old Ron that do not contain the  taxa-
tion keyword.  Or we could go for

    Ronald Reagan & ( taxation | star wars )

Which shows us articles about Ron that have  nothing  to  do
with  taxation  or  star  wars.  (If you haven't guessed, we
probably refer to ABM star wars in this case.)

     The order in the file is  important.   When  phase  one
tries  to  figure  out if a user wants to see an article, it
scans through the information in the subscription  list,  in
order.   It  stops as soon as it finds some form of definite
information.  This  means  either  positive  information  or
negative  information.   If the first line in your subscrip-
tion file is "Ronald Reagan", you will see  all  such  arti-
cles,  even  if  they  contain other keywords that you hate.
Likewise, if the first line in the file is "!Ronald Reagan",
you will never see an article about him, even if it contains
a keyword you subscribe to later on.  (There is an alternate
system described below to change this.)

     The character "*" will match any keyword.  It would  be
placed  on  the last line of a subscription file to indicate
that any keyword not marked with an "!"  is  subscribed  to.
It  is  doubtful  anybody would use this after the number of
keywords grows.

     Keywords may have "sort attributes" on them to indicate
which  keywords  you  would  like to see first in a session.
These are essentially ascii strings which will be passed  to
sort(1).   If  you  want to see articles about "system shut-
down" first, you give it a low value like "A".   If you want
to  see articles about "big mac" last you give a priority of
the form "zzzzzzzz".  The nice thing about this is that when
you  have  a  new keyword, you can easily give it a priority
between any two that exist, unless you have given  something
a  priority  like  "^@", in which case it would be first for
all time.  We now see lines like:

    system shutdown [AAA]
    space [bb] & challenger [cc]


     Thus phase one works as follows:  It first picks up the
last  time  you  read news and finds the appropriate spot in
the history files.  The subscription file  is  read  in  and
parsed  into  a tree.  Mappings from keywords to sort values
are recorded, and keywords without sort values are  assigned


Brad Templeton                                             6







URFC 001                                              K NEWS


values  just after the value of the keywords on the previous
line.  Keywords not in  the  file  will  get  a  sort  value
equivalent  to  something like their keyword name with a "z"
in front of it.  Lines from the history file  are  read  in,
and  matched  against the subscription list.  If they match,
the appropriate line is written out onto a  temporary  file.
Note  that  matching  are  applied  both  to keywords and to
message-ids in the "References" line.  It is  thus  possible
to  request  to  take  or shut off sub discussions as can be
done in notesfiles.  Instead of writing out the keywords  to
this  file,  we  write instead the sort values given to each
keyword.  These sort values are  themselves  sorted  on  the
line before being written.  The old keywords can be appended
to the line if they are required.

     Once the new  file  is  prepared  it  is  sent  off  to
sort(1),  possibly with the file of unread articles appended
to it.  The first sort key is the keyword sort values. Since
followups  all  have  the  same keywords, they will match as
equal in the first sort key.  Since we are  sorting  by  the
keyword  sort values, the output file will have the articles
sorted by  keywords  in  the  presentation  order  the  user
requested.   The  next  sort  key is the "References" chain,
which includes the message-id of the article  if  it  is  an
original article.  Thus all followups to a given article are
sorted in a nice tree.  The final sort key  is  the  posting
date,  so that followups to an article are shown in order of
the time they were posted.

     Once sort is called we will have a file which  has,  in
addition  to  a  lot of garbage, a list of pathnames for the
articles the user wishes to read.   The  keywords  on  these
articles may also be present.  This is passed to phase two.

_2._3.  _P_h_a_s_e _T_w_o

     Phase two is the user interface.  It can be  very  dumb
or  it can be quite fancy.  Since it gets passed a readymade
list of articles, there is not much work to do.  All a  sim-
ple  one need do is go through the list, and doing what msgs
or readnews currently does to  each  file.   These  programs
will  handle replies, followups etc.  Special utilities will
be provided for cancelling etc.

     When a user skips an article, the program can write the
appropriate line to the unread article file noted above.  It
is hoped the average user will not let  this  file  get  too
big.   More sophisticated programs will keep track of a list
of seek addresses in the sort output file that mark articles
that  have  not  been  read, and output this at the end of a
session.  This allows programs to allow users to  skip  back
and  forth  among  the articles since the information is not
written out until the end.  In fact, it might  be  a  useful


Brad Templeton                                             7







URFC 001                                              K NEWS


utility to provide for writers of user interfaces.

     User interfaces can get quite fancy, with  screen  sys-
tems like notesfiles and vnews.  It would be nice to provide
a feature so that unrecognized commands are  passed  to  the
shell  with a search path list including a special directory
for news commands.  (Perhaps an environment variable so  the
user  can  specify.)   In the news command directory you put
simple commands like "decrypt" and "undigest" with appropri-
ate  short  names.   It is expected that several user inter-
faces will be written, including one just like  B  news  and
one  just  like notesfiles.  All interfaces to the subscrip-
tion file by the user interface  program  should  be  though
other programs that are part of phase one if possible.  This
keeps things apart.

_2._3._1.  _T_y_p_i_c_a_l _S_e_s_s_i_o_n

     The typical user interface program will first check  to
see  what  new keywords have come in since the last session.
These will be recorded in a seperate history file  in  which
the last position read must be recorded.  The user, if it is
requested by appropriate options, will then be given a  list
of  new keywords that have appeared since the last time news
was read.  Some systems will query the user and allow him or
her to place these new keywords in the subscription files.

     The user interface must now call the phase one program.
with appropriate options, and the name of a temp file to put
the sort output in.  It may also request the sort output  on
a pipe if that is all it needs.  (Most programs will want to
be able to seek back in the output file.) Articles will then
be  shown  in the order requested, grouped perhaps according
to followup discussions or major keywords.  At  the  end,  a
list  of unread articles will be written out.  Articles will
probably be grouped by discussions and higher priority  key-
words.   Followups  will  insist  on a change of subject and
allow an addition of keywords and a change of the  distribu-
tion.

_3.  _A_l_t_e_r_n_a_t_e _S_u_b_s_c_r_i_p_t_i_o_n _I_d_e_a

     It is possible users will require more control on which
subscription  lines get priority than the order in the file.
Thus it is proposed that keywords get points  based  on  how
much  a  user  wants to see a keyword.  Keywords you want to
see would get positive points and keywords you don't want to
see  would get negative points.  For example: "Ronald Reagan
: 5" would assign 5 points to any  article  containing  that
keyword.  On the other hand "star wars : -4" and "taxation :
-6" would assign negative points to those keywords.  In this
case,  you would see articles with Reagan and star wars, but
would not see articles with  Reagan  and  taxation.   Scores


Brad Templeton                                             8







URFC 001                                              K NEWS


would apply to whole lines.  For example:

    Ronald Reagan [abc] & taxation [cde] : 20

Would give 20 points to any article with both keywords.

     In this system, any article must scan the  whole  list.
For  every match we get, we add the points assigned for that
match to our sum.  If, at the end, the  sum  is  >=  0,  the
users  sees  the  article.  If negative, it is not seen.  It
should also be possible to assign scores of "oo"  and  "-oo"
which  would  represent  infinite  scores  and stop the scan
right away.

     In any system, by the way, the whole subscription  file
must be read into RAM.  Since the phase one program has lit-
tle to do but read this file, however,  the  K  news  system
should  be  able  to handle large subscription files.  Since
followup message-ids will also be placed  in  this  file,  a
utility that deletes very old ones would be a good idea.

_4.  _C_o_m_m_e_n_t_s

     This is just a  draft  proposal,  and  lots  of  little
details are missing.  comments are welcome.  Also welcome is
somebody to implement the thing since many  people  are  too
busy  to  do so.  The implementation could be done in spots,
and much of the code can be taken from the existing  B  news
since the same header formats etc.  would be used.  I can be
reached at watmath!bstempleton.  Watmath is called by ihnp4,
decvax, utzoo, allegra, utcsrgv, hcr and many others.
























Brad Templeton                                             9



-- 
	Brad Templeton - Waterloo, Ont. (519) 886-7304

ka@spanky.UUCP (06/14/83)

Brad, there are two parts to your proposal--the design and an im-
plementation.  I will pass over the latter very quickly.  It
looks like your scheme is implementable but involves a lot of
work.  Most of the efficiency problems associated with large
numbers of groups have been fixed in 2.10, so don't expect K news
to run faster than B news (although it would be nice if it did).


One problem with having too many newsgroups is that they are dif-
ficult to keep track of.  For example, you get people posting to
net.general because they don't know where else to post it.
Switching to the keyword system you propose will eliminate the
net.general problem by eliminating net.general, but it may make
the total confusion worse.  You give an example which includes
"star wars" as a keyword referring to the Reagan proposal; if I
blithely unsubscribe to that keyword I'm likely to loose stuff
about the movie as well.  The same problem occurs under the
current system (e. g. is net.railroad for discussing real rail-
roads or model railroads?), but we at least have an official list
of newsgroups to resolve such disputes.  It's hard to say how big
the problem would be without trying it.  Probably there should be
an official list of "permanent" keywords which would define com-
monly used keywords including article types like "flame" and
"joke" as well as common topics like "unix" and "c".  It is espe-
cially important to keep track of keyword which are special to
the K news system, like keywords which control distribution

A related problem involves updating subscription lists, since new
keywords will continuously be invented.  Providing a list of new
keywords and the opportunity to update the subscription file each
time news is read eliminates most of my concerns.  A lot depends
upon the rate of addition of new keywords.  I would expect quite
a few of them--every time I mention something silly like ham-
burgers or socks there is the chance that a huge discussion will
develop, so I will create a new keyword just in case.  My guess
is that the fancy feature you suggest will be only used for a few
common keywords.  Other keywords will just be either listed or
not listed because they change too often.

Another part of your proposal calls for linking articles and fol-
lowups together.  This is a good idea.  I wonder if this makes a
keyword system unnecessary because discussions are already
grouped together.  Probably not--the keyword system is more flex-
ible.  For example, you can post to a variety of keywords, but
you can't post an article that is a followup to two articles
simultaneously.  The use of additional keywords to identify
things like flames also seems helpful.

The keyword/newsgroup interface would have to remain around more
or less indefinitely.  My understanding is that USENET currently
runs on four basic systems: A news, B news, notesfile, and
[unidentified] BITNET software.  In addition, certain newsgroups
are gatewayed into ARPANET mailing lists.  K news may replace the
first three, but it can't replace the BITNET software (which runs
on IBM hardware) or ARPANET mailing lists (which go to *many*
different types of machines).  This might not be too bad, al-
though it would complicate life for the K news user.  The best
bet seems to be to include the newsgroups in the keyword list.
Keeping a translation table on each system would be very diffi-
cult to make function correctly; probably articles discussing
Reagan's star wars proposal would end up in net.sf-lovers.  How-
ever, a translation system which showed the user the generated
newsgroups and asked if they were correct would be OK.  At some
point we might get all of USENET converted to keywords; at that
point there would just be a list of keywords for the Arpanet
mailing lists.

One miscellaneous point--I don't like long newsgroup names much
and the keywords would be likely to be long.  In fact, the propo-
sal calls for keywords ending in "_Distribution" for distribu-
tion, so I would have to type that for each article I post.  More
importantly, I have to wait while the user interface displays
those keywords on my screen.  We could abbreviate this, e. g.
"Dist nj", although that potentially is more confusing to new
users.  I would simply use newsgroup names as keywords rather
than mapping "net.space" into "Newsgroup net.space"; the presence
of the dot should make plain "net.space" clear enough.

Another element of your proposal is requiring users to specify
titles for followups.  We went through this when developing the
USENET Interchange Standard.  This calls for interfaces to pro-
vide a default title consisting of the original title, preceeded
by "Re: " if the original title did not already have this.  It
does, however, allow the user to specify a different title if she
desires.  The reasoning behind this approach is that there is no
need to change the title if the topic is really the same, which
is usually the case.  (Can you think of a better title for this
article?)  I agree that there has been a tendency for people to
rely on the default title too much, but I expect that this is
largely because they don't know how rather than that they are too
lazy.  The readnews 'f' command will take a title as an argument
after the '-', if any.  In vnews you can edit the "Subject:" line
that appears when you enter the editor.

K news seems to fit in pretty well with the USENET Interchage
Standard.  The standard does not define a "Keyword:" line, but
code to support one is already in 2.10, and older sites should
pass it through unchanged.  I would be hesitant about adding a
second keywords line for keywords added to a followup, as I
understand you are proposing.  Generation of the "Newsgroup:" and
"Distribution:" lines can be done at the posting site.  (The
"Distribution:" line won't really work until almost all sites
have switched to 2.10; K news will have to wait for that.)

A lot of thought obviously went into your proposal.  As you can
tell from the length of this response, I am interested in the
idea.  I'll comment more on the implementation in a couple of
days.
                                       Kenneth Almquist