bstempleton@watmath.UUCP (06/11/83)
Show Mercy - it's the first draft. URFC 001 K NEWS Keyword News System By Brad Templeton USENET Request For Comment 001 For some time people have been using a news system based on newsgroups. This is a short outline of my proposal for a news system based on a classification system I called keywords. The only essential difference between a newsgroup and a keyword is that the Keyword news system (or K news) is designed so that there is a very small overhead for each keyword. It is thus possible to have thousands and thousands of active keywords with little overhead. It is my feeling that several problems have emerged with the old newsgroup style system. These follow: (1) Due to the limited number of groups, there is a great deal of traffic concerning what articles belong in which groups and whether certain groups should be created or destroyed. Under K news, there is no such discussion. If you want a new keyword, you create it. If you want to use a name that is long and descriptive, you can. If discussions go under several keywords, it is easy to add them to your list. (2) The limited number of groups also creates groups like "net.misc" and "net.general" which are difficult to work with. K news eliminates the need for net.misc and allows easy renaming of net.general. (3) Current systems only allow an "or"ing of groups when dealing with multiple groups. In K news, it is possi- ble to request articles that deal with a set of key- words. ie. one can ask to be shown only articles that contain both the "science fiction" keyword and the "movie" keyword. (4) Current systems do not allow grouping all followups to a given article together, or sorting articles according to posting date. K news provides this because it uses sort(1) on the complete list of articles to be seen. (5) Current news systems are slower than they should be because they must scan each newsgroup a users sub- scribes to to see if there is news. Knews does not have this problem. One main idea behind K news stems from the fact that the average news reader normally reads the news that has Brad Templeton 1 URFC 001 K NEWS arrived since news was last read. Thus, instead of scanning directories and keeping track of what has been read, K news scans a history file and keeps track of what has NOT been read. In a given session, the history file is scanned from the point in time when news was last read. In addition, a file of articles not read from the previous session is scanned. The user may request to see the old articles first, or to have them merged in with the newer ones. Find- ing out what to read is a simple matter of scanning a few files and should be quite fast. _1. _T_h_e _K_e_y_w_o_r_d _E_n_v_i_r_o_n_m_e_n_t K news can solve the B news problems by promoting a different environment with keywords. First of all, the dis- tribution of an article is taken out of the keyword name. This means all keywords are valid over all distributions. The fact that there is an "auto" keyword means you can post an auto article to netwide, statewide or even local distri- bution. This should cut down on the number of people advertising their cars to "net.auto" because the only auto group has netwide distribution. An article will have several keywords. The K news sys- tem will probably insist on members from certain sets of keywords be there. For example, there should be a distribu- tion keyword with any article that is not local. There might be a "followup" keyword on any followup, although these can be detected from their "References" string. Key- words like "spoiler" and "flame" can be put with articles so that people can request not to see them. (Rediculous groups like net.flame go away.) It seems that all articles seem to fall into a certain set of classes. These classes are "query", "original infor- mation", "reprint", "opinion" and "followup". There are some sub-classes, such as "flame" (a type of opinion) and "source code" (a type of original information). It might be a good idea to insist that all posters provide one of these keywords, with the followup keyword being automatic. Thus a reader can shut off all queries or all opinion articles. Groups like "net.misc" will no longer be needed. Any new discussion can easily rate a new keyword, from "big mac" to "socks in hyperspace". The group "net.general" is still a bit of a problem, but it can now be replaced with some- thing like "announcement for all users", and there will be very little implicit cry to put the article in the netwide distribution. There will still be problems, but they will be reduced. It's also possible that we will still get a lot of the "You posted that to the wrong keyword" type stuff. It is Brad Templeton 2 URFC 001 K NEWS hoped that since adding a keyword to your subscription list will be quite easy, people will not complain too much about this. Even so, it is still possible that some utility to help users select keywords will be required. Each site will keep all known keywords in a DBM type file. (This will be the total overhead for each keyword.) The DBM file entry might contain who first used the keyword, a one line entry describing it, and its newsgroup mapping on a B to K inter- face system. A simple utility might scan a user's article for any of the keywords that occur in the text of the arti- cle and suggest them as possible entries. In addition, if the user suggests a new keyword when posting an article, a search for keywords that the new one could be an incorrect spelling of would be in order. Since the keywords are the important thing that get copied over in a followup, the subject line will not remain the same. One current problem under B news is that you get discussions that wander under the same subject line. This subject soon becomes meaningless. Any followup generated with K news will have an entirely new subject line, since both the keywords and the References string will provide an indication of what is a followup to what. _2. _T_h_e _K _n_e_w_s _i_m_p_l_e_m_e_n_t_a_t_i_o_n The K news system consists of three major parts. _2._1. _N_e_w_s _I_n_p_u_t Part one is the news entry system, much like the inews part of the B news system. With this inews, whenever an article comes in, it gets placed in a file somewhere. It can be any file, and can even (on something like a unix sys- tem where seeks are possible) be placed on the end of a file. If another news system like B news is running, it is highly likely that B inews will place the file in one of the newsgroup directories in /usr/spool/news. If not, some other program will find a place for the file, perhaps by hashing or by using something based on the article-id. Whatever the program is that places the article in a file, the filename should be passed to the K news pickup program. This program will take the article, and examine the header. Important information about the article will be written to a special history file. This will include the keywords associated with the article, the "References:" string of the article plus its message-id, the date of post- ing, the pathname of the file containing the article with optional seek address and length, and finally the subject line. Note, by the way, that in the case of a followup, any Brad Templeton 3 URFC 001 K NEWS extra keywords that were not in the original article will have to be placed in an extra field so they are not involved in the sorting that groups articles with their followups. History files will be maintained on a one per day basis, in a special history directory. Each history file's name will be formed from the date for that history file. (Perhaps in days since the birthday of the net, or perhaps in the form yymmdd.) There may be a new history file each day, each week, or even every hour as the site requires. K inews will query the date and time from the system and decide which history file to append to. If working with B news, there is no need to have any system for forwarding to other sites. If K news is done properly, however, B news will go away and such a system will be needed. It should be noted that on sites that sup- port both B news and K news, a mapping table that maps news- groups to associated keywords and vice versa will be needed. With new modifications to uux possible, It is invisioned that each site receiving news from a K news site would essentially have a .newsrc like file on the forwarding site. This is to say that each site would be in the same position as a user, with a keyword subscription list and a list of unread articles. Forwarding could either be done by using the same process a user does to read news when a transfer is made, or by having the K inews check each article in the subscription files for known sites. The first way, of course, is much more efficient. _2._1._1. _B _t_o _K _i_n_t_e_r_f_a_c_e The B to K news interface will be quite difficult while they co-exist. Forwarding B articles into the K system is the easier part. All that is required is a mapping table that maps newsgroups to keywords. A site that runs both would have to have their administrator figure new entries for the table when a new newsgroup is created. It is also possible that any item in a newsgroup that does not have an entry in the map table simply be mapped to the keyword "Newsgroup ngname". For example, articles in net.space would be mapped to "Newsgroup space". If there is a mapping table, they will likely be mapped to the keyword "space" as well. It is possible you will want to have all B news arti- cles tagged with a "B News" keyword so that they can be spotted. The other way is more difficult, since K news users will be creating new keywords all the time. It would be a very difficult task for the administrator of a site to keep adding entries to the map table. It is thus proposed that during the interim, whenever a K news user creates a new keyword, that user must provide a newsgroup mapping for it. Brad Templeton 4 URFC 001 K NEWS For example, if I am the first to use the "Frank Zappa" key- word, I will be asked to provide a newsgroup for it, perhaps net.music. In this case, a new header item giving the map would be added to the article, and any interface site would recognize this line and put the new mapping in the K to B map. The problem with this is that it requires K news users to be aware of B newsgroups which they will consider anachronisms in their own system. On the other hand, the idea of newsgroups as more global level keywords has some appeal, since it allows grouping of messages in one direc- tory for doing greps. _2._1._2. _D_i_s_t_r_i_b_u_t_i_o_n Distribution will be done by special keywords, perhaps those ending in the string "_Distribution". For example, an article to go to the net would be posted to "Netwide_Distribution" and one for Canada would go under "Canada_Distribution". Thus site subscription files would be quite simple, involving only Distribution keywords. _2._2. _R_e_a_d_i_n_g _N_e_w_s - _P_h_a_s_e _O_n_e Reading news with K news is broken up into two parts. The first part is general, and contains no user interface. This is done so that people who wish to design their own user interfaces can do so easily. _2._2._1. _S_u_b_s_c_r_i_p_t_i_o_n _L_i_s_t The first program will maintain two files. The first is the subscription list. This tells which keywords and discussions the user is interested in. This will be a list of keywords subscribed to and boolean expressions built from them. Keywords are actually text strings, but they may not contain a special set of characters which are used to del- imit them. These characters are ":", ",", "!", "[", "]", "&", "|", "*", "(", and ")" to start with. (More may be thought of while the program is being written.) Each line consists of a keyword pattern to describe the user's interests. In addition, some special lines in the subscrip- tion file will tell what the user wants done with articles from the previous session, and possibly special options. A typical subscription line lists a keyword pattern. For example, the line: science fiction Asks for all articles with the keyword "science fiction". Quotes may be required, but this is a matter to be decided. It also makes sense that any blank fields in a keyword be compressed to one space so that typos do not cause problems. Brad Templeton 5 URFC 001 K NEWS The line "!star wars" would ask that no articles with the keyword star wars be shown. We can also ask for "Ronald Reagan & taxation" to ask for all articles with both of the keywords show. Similarly "Ronald Reagan & !taxation" shows us all articles about old Ron that do not contain the taxa- tion keyword. Or we could go for Ronald Reagan & ( taxation | star wars ) Which shows us articles about Ron that have nothing to do with taxation or star wars. (If you haven't guessed, we probably refer to ABM star wars in this case.) The order in the file is important. When phase one tries to figure out if a user wants to see an article, it scans through the information in the subscription list, in order. It stops as soon as it finds some form of definite information. This means either positive information or negative information. If the first line in your subscrip- tion file is "Ronald Reagan", you will see all such arti- cles, even if they contain other keywords that you hate. Likewise, if the first line in the file is "!Ronald Reagan", you will never see an article about him, even if it contains a keyword you subscribe to later on. (There is an alternate system described below to change this.) The character "*" will match any keyword. It would be placed on the last line of a subscription file to indicate that any keyword not marked with an "!" is subscribed to. It is doubtful anybody would use this after the number of keywords grows. Keywords may have "sort attributes" on them to indicate which keywords you would like to see first in a session. These are essentially ascii strings which will be passed to sort(1). If you want to see articles about "system shut- down" first, you give it a low value like "A". If you want to see articles about "big mac" last you give a priority of the form "zzzzzzzz". The nice thing about this is that when you have a new keyword, you can easily give it a priority between any two that exist, unless you have given something a priority like "^@", in which case it would be first for all time. We now see lines like: system shutdown [AAA] space [bb] & challenger [cc] Thus phase one works as follows: It first picks up the last time you read news and finds the appropriate spot in the history files. The subscription file is read in and parsed into a tree. Mappings from keywords to sort values are recorded, and keywords without sort values are assigned Brad Templeton 6 URFC 001 K NEWS values just after the value of the keywords on the previous line. Keywords not in the file will get a sort value equivalent to something like their keyword name with a "z" in front of it. Lines from the history file are read in, and matched against the subscription list. If they match, the appropriate line is written out onto a temporary file. Note that matching are applied both to keywords and to message-ids in the "References" line. It is thus possible to request to take or shut off sub discussions as can be done in notesfiles. Instead of writing out the keywords to this file, we write instead the sort values given to each keyword. These sort values are themselves sorted on the line before being written. The old keywords can be appended to the line if they are required. Once the new file is prepared it is sent off to sort(1), possibly with the file of unread articles appended to it. The first sort key is the keyword sort values. Since followups all have the same keywords, they will match as equal in the first sort key. Since we are sorting by the keyword sort values, the output file will have the articles sorted by keywords in the presentation order the user requested. The next sort key is the "References" chain, which includes the message-id of the article if it is an original article. Thus all followups to a given article are sorted in a nice tree. The final sort key is the posting date, so that followups to an article are shown in order of the time they were posted. Once sort is called we will have a file which has, in addition to a lot of garbage, a list of pathnames for the articles the user wishes to read. The keywords on these articles may also be present. This is passed to phase two. _2._3. _P_h_a_s_e _T_w_o Phase two is the user interface. It can be very dumb or it can be quite fancy. Since it gets passed a readymade list of articles, there is not much work to do. All a sim- ple one need do is go through the list, and doing what msgs or readnews currently does to each file. These programs will handle replies, followups etc. Special utilities will be provided for cancelling etc. When a user skips an article, the program can write the appropriate line to the unread article file noted above. It is hoped the average user will not let this file get too big. More sophisticated programs will keep track of a list of seek addresses in the sort output file that mark articles that have not been read, and output this at the end of a session. This allows programs to allow users to skip back and forth among the articles since the information is not written out until the end. In fact, it might be a useful Brad Templeton 7 URFC 001 K NEWS utility to provide for writers of user interfaces. User interfaces can get quite fancy, with screen sys- tems like notesfiles and vnews. It would be nice to provide a feature so that unrecognized commands are passed to the shell with a search path list including a special directory for news commands. (Perhaps an environment variable so the user can specify.) In the news command directory you put simple commands like "decrypt" and "undigest" with appropri- ate short names. It is expected that several user inter- faces will be written, including one just like B news and one just like notesfiles. All interfaces to the subscrip- tion file by the user interface program should be though other programs that are part of phase one if possible. This keeps things apart. _2._3._1. _T_y_p_i_c_a_l _S_e_s_s_i_o_n The typical user interface program will first check to see what new keywords have come in since the last session. These will be recorded in a seperate history file in which the last position read must be recorded. The user, if it is requested by appropriate options, will then be given a list of new keywords that have appeared since the last time news was read. Some systems will query the user and allow him or her to place these new keywords in the subscription files. The user interface must now call the phase one program. with appropriate options, and the name of a temp file to put the sort output in. It may also request the sort output on a pipe if that is all it needs. (Most programs will want to be able to seek back in the output file.) Articles will then be shown in the order requested, grouped perhaps according to followup discussions or major keywords. At the end, a list of unread articles will be written out. Articles will probably be grouped by discussions and higher priority key- words. Followups will insist on a change of subject and allow an addition of keywords and a change of the distribu- tion. _3. _A_l_t_e_r_n_a_t_e _S_u_b_s_c_r_i_p_t_i_o_n _I_d_e_a It is possible users will require more control on which subscription lines get priority than the order in the file. Thus it is proposed that keywords get points based on how much a user wants to see a keyword. Keywords you want to see would get positive points and keywords you don't want to see would get negative points. For example: "Ronald Reagan : 5" would assign 5 points to any article containing that keyword. On the other hand "star wars : -4" and "taxation : -6" would assign negative points to those keywords. In this case, you would see articles with Reagan and star wars, but would not see articles with Reagan and taxation. Scores Brad Templeton 8 URFC 001 K NEWS would apply to whole lines. For example: Ronald Reagan [abc] & taxation [cde] : 20 Would give 20 points to any article with both keywords. In this system, any article must scan the whole list. For every match we get, we add the points assigned for that match to our sum. If, at the end, the sum is >= 0, the users sees the article. If negative, it is not seen. It should also be possible to assign scores of "oo" and "-oo" which would represent infinite scores and stop the scan right away. In any system, by the way, the whole subscription file must be read into RAM. Since the phase one program has lit- tle to do but read this file, however, the K news system should be able to handle large subscription files. Since followup message-ids will also be placed in this file, a utility that deletes very old ones would be a good idea. _4. _C_o_m_m_e_n_t_s This is just a draft proposal, and lots of little details are missing. comments are welcome. Also welcome is somebody to implement the thing since many people are too busy to do so. The implementation could be done in spots, and much of the code can be taken from the existing B news since the same header formats etc. would be used. I can be reached at watmath!bstempleton. Watmath is called by ihnp4, decvax, utzoo, allegra, utcsrgv, hcr and many others. Brad Templeton 9 -- Brad Templeton - Waterloo, Ont. (519) 886-7304
ka@spanky.UUCP (06/14/83)
Brad, there are two parts to your proposal--the design and an im- plementation. I will pass over the latter very quickly. It looks like your scheme is implementable but involves a lot of work. Most of the efficiency problems associated with large numbers of groups have been fixed in 2.10, so don't expect K news to run faster than B news (although it would be nice if it did). One problem with having too many newsgroups is that they are dif- ficult to keep track of. For example, you get people posting to net.general because they don't know where else to post it. Switching to the keyword system you propose will eliminate the net.general problem by eliminating net.general, but it may make the total confusion worse. You give an example which includes "star wars" as a keyword referring to the Reagan proposal; if I blithely unsubscribe to that keyword I'm likely to loose stuff about the movie as well. The same problem occurs under the current system (e. g. is net.railroad for discussing real rail- roads or model railroads?), but we at least have an official list of newsgroups to resolve such disputes. It's hard to say how big the problem would be without trying it. Probably there should be an official list of "permanent" keywords which would define com- monly used keywords including article types like "flame" and "joke" as well as common topics like "unix" and "c". It is espe- cially important to keep track of keyword which are special to the K news system, like keywords which control distribution A related problem involves updating subscription lists, since new keywords will continuously be invented. Providing a list of new keywords and the opportunity to update the subscription file each time news is read eliminates most of my concerns. A lot depends upon the rate of addition of new keywords. I would expect quite a few of them--every time I mention something silly like ham- burgers or socks there is the chance that a huge discussion will develop, so I will create a new keyword just in case. My guess is that the fancy feature you suggest will be only used for a few common keywords. Other keywords will just be either listed or not listed because they change too often. Another part of your proposal calls for linking articles and fol- lowups together. This is a good idea. I wonder if this makes a keyword system unnecessary because discussions are already grouped together. Probably not--the keyword system is more flex- ible. For example, you can post to a variety of keywords, but you can't post an article that is a followup to two articles simultaneously. The use of additional keywords to identify things like flames also seems helpful. The keyword/newsgroup interface would have to remain around more or less indefinitely. My understanding is that USENET currently runs on four basic systems: A news, B news, notesfile, and [unidentified] BITNET software. In addition, certain newsgroups are gatewayed into ARPANET mailing lists. K news may replace the first three, but it can't replace the BITNET software (which runs on IBM hardware) or ARPANET mailing lists (which go to *many* different types of machines). This might not be too bad, al- though it would complicate life for the K news user. The best bet seems to be to include the newsgroups in the keyword list. Keeping a translation table on each system would be very diffi- cult to make function correctly; probably articles discussing Reagan's star wars proposal would end up in net.sf-lovers. How- ever, a translation system which showed the user the generated newsgroups and asked if they were correct would be OK. At some point we might get all of USENET converted to keywords; at that point there would just be a list of keywords for the Arpanet mailing lists. One miscellaneous point--I don't like long newsgroup names much and the keywords would be likely to be long. In fact, the propo- sal calls for keywords ending in "_Distribution" for distribu- tion, so I would have to type that for each article I post. More importantly, I have to wait while the user interface displays those keywords on my screen. We could abbreviate this, e. g. "Dist nj", although that potentially is more confusing to new users. I would simply use newsgroup names as keywords rather than mapping "net.space" into "Newsgroup net.space"; the presence of the dot should make plain "net.space" clear enough. Another element of your proposal is requiring users to specify titles for followups. We went through this when developing the USENET Interchange Standard. This calls for interfaces to pro- vide a default title consisting of the original title, preceeded by "Re: " if the original title did not already have this. It does, however, allow the user to specify a different title if she desires. The reasoning behind this approach is that there is no need to change the title if the topic is really the same, which is usually the case. (Can you think of a better title for this article?) I agree that there has been a tendency for people to rely on the default title too much, but I expect that this is largely because they don't know how rather than that they are too lazy. The readnews 'f' command will take a title as an argument after the '-', if any. In vnews you can edit the "Subject:" line that appears when you enter the editor. K news seems to fit in pretty well with the USENET Interchage Standard. The standard does not define a "Keyword:" line, but code to support one is already in 2.10, and older sites should pass it through unchanged. I would be hesitant about adding a second keywords line for keywords added to a followup, as I understand you are proposing. Generation of the "Newsgroup:" and "Distribution:" lines can be done at the posting site. (The "Distribution:" line won't really work until almost all sites have switched to 2.10; K news will have to wait for that.) A lot of thought obviously went into your proposal. As you can tell from the length of this response, I am interested in the idea. I'll comment more on the implementation in a couple of days. Kenneth Almquist