lauren@vortex.UUCP (Lauren Weinstein) (02/25/85)
If we keep creating new newsgroups whenever someone finds "more interest than he/she expected" in a topic, we're going to kill Usenet faster than really necessary. For many topics, we find a fairly small group of people who are interested in the discussion, and a MASSIVE group of people who couldn't care less. Yet the netnews goes pretty much everywhere, regardless of the actual real interest. For many of the topics now being proposed as groups, the use of coordinated mailing lists could be far more efficient both in terms of delivery time and costs. If we gave the matter some thought, we might find that some of the backbone sites, which have to bear the brunt of mass netnews postings, might be willing to help support some mailing lists. These might obviate the need for new newsgroups being created whenever someone finds a new collection of people who want to discuss a particular topic. The current netnews philosophy, which basically amounts to "send everything, everywhere, and damn the cost" in most cases, is becoming increasingly self-destructive. --Lauren--
paul@mogwai.UUCP (Paul H. Mauritz) (02/26/85)
> If we keep creating new newsgroups whenever someone finds "more interest > than he/she expected" in a topic, we're going to kill Usenet > faster than really necessary. For many topics, we find a fairly > > If we gave the matter some thought, we might find that some of the > backbone sites, which have to bear the brunt of mass netnews postings, > might be willing to help support some mailing lists. These might Bravo Lauren. As an additional note, I was a member of a group proposing a new news group. One thoughtful person suggested that we instead start a mailing list, which would include all articles which people would like to submit. One person volunteered as coordinator - NOTE: I said coordinator, not moderator. he/she simply mails out to all interested people all the stuff he recieves on the topic. It has been around for several months, and is working out fine - in my opinion. Mailing lists could effectively - and less expensively - replace a large number of the current news groups. -- Paul H. Mauritz - Digital Equipment Corporation UUCP: decvax!grendel!paul ARPA: grendel!paul@seismo.ARPA AT&T: (301) 459-7956 USPS: 8301 Professional Place, Landover MD USA 20785, MS-DCO/913 "We do not inherit the world from our ancestors, we borrow it from our children."
hokey@plus5.UUCP (Hokey) (02/28/85)
I would almost think it is time to re-examine the possibilities for a keyword-based news system. We could create all the keywords we wanted, and just come up with an aliases database to handle the similarities. Sure would make things different! -- Hokey ..ihnp4!plus5!hokey 314-725-9492
chuqui@nsc.UUCP (The Phantom) (03/03/85)
In article <614@plus5.UUCP> hokey@plus5.UUCP (Hokey) writes: >I would almost think it is time to re-examine the possibilities for a >keyword-based news system. We could create all the keywords we wanted, >and just come up with an aliases database to handle the similarities. > >Sure would make things different! My opinion would be that it would make life worse, not better. I actually have some 'facts' to base this comment on. I have been experimenting with a new archiving system for netnews-- I've needed it because I have something like 55 Megabytes of usenet archive from various groups that I'd like to find things in once in a while (as you might imagine, 'grep' doesn't really cut it on things of that size.) since there IS a lot of good information out there, somewhere. I started looking at this last spring, when I realized I was going to need it (this was about when 'grep <word> /usr/spool/oldnews/net/unix-wizards/*' failed from too many file names. I finally got a design I thought was workable, which got posted as an RFC to unix-wizards and net.news soemtime around christmas (I got no comments back, BTW). I've been playing with various parts of the design as prototypes since, looking at the design versus the reality, and trying to decide whether or not will work. The answer, for now, is that it won't. Here are the problems I ran in to-- To some degree I'm simplifying, to some degree I've probably missed something obvious. Comments are welcome (preferably by mail) since I'm still trying to build an archiver... My database is 50+ meg, and growing by 1-2 Meg a week. This is a LOT of data, and a lot of it is out of date, duplicated, or not exceptionally useful. Excepting net.sources, it is broken into tiny pieces (somewhere between 1000 and 3000 bytes seems to be the average) and so you have a LOT of files (estimate left as an excercise for the reader). Implementing a lot of these things on a standard news system means it hits a lot LESS data, but you also need to worry about expiration of data, a whole new ballgame. One strong requirement for a keyword lookup is speed, which means that the keyword database much be created before it is needed. For me, this meant when an article was added to the archive, for news, when it is given to rnews (or whatever replaces it). Rnews is already a hog. Generating keywords is not what I would call a trivial exercise, and while I found a fairly elegant way of doing it we are still talking about a fair amount of cycles. We are also talking about a fair amount of keywords. Dbm is out-- it doesn't like multiple keys, which means you get to figure out some way of storing them for quick access. You also need to store them, which takes space. I just took a quick look at my system: Unix-wizards has 370 articles, singles has 312, jokes has 530, flame has 449, politics has 528 and music has 385. That is about 2574 articles in 6 of the larger groups. My history file shows 12,000 articles total. I was generating about 10 keywords average on an article by pulling significant words out of the first block of the article past the header (the Keywords: line is either ignored or used for 'cute' words often enough that I stopped playing with it for now). Assume 5 characters for an average word. That gives us, just for keyword storage, 600,000K characters, and there is no information about where those keywords point to. So, add a filename 'unix-wizards/10043' for another 20 characters- giving us 25 characters an entry, or about 3 megabytes. What this means is we need to create, maintain, and update a 3 megabyte keyword database using a history file like format (each keyword in each article has a line in a file of the form '<keyword> <path>'. That is about as compact as you will get, but lookup is SLOW (linear searching) and record deletion is essentially impossible without regenerating the data base (anyone seen how much time expire already takes? *grin*). What you end up doing is setting up a database of databases keyed on keywords, which helps lookup speed but not the deletion problem. Perhaps a set of dbm files, a single dbm file for each keyword will work, but you run into the problem that the standard dbm only understands one file. mdbm is a solution, if it can be made generally available. The other problem with this is that the more you do to make the data flexible, the more you are going to make the data spread out, taking up more disk. If you go to a full fledged dbm scheme, you'll be lucky to store your three megabytes in less than 6 or 7. As far as I can tell, to get a system that will allow fast keyword lookup, a reasonably fast deletion for expire, and allowing the overhead to be taken in by rnews, you are looking at something like an added 10-15% of CPU overhead to rnews for the keyword generation and database setup, about 5% to expire, and a 50% increase in disk storage. I simply can't afford that right now. An added problem is simply the one that keywords are highly dependent upon the submitter. What is f77 to one person is fortran to another, ftn to a third, fortrash to a fourth, and obsolete to a fifth. This makes finding something in a keyword database a bit of a game-- you look for an obvious word, which gives you hints towards other words (maybe), and so on. I found myself running four and 5 iterations to find things, and either getting a lot of unrelated stuff (except incidentally) or not getting what I'm looking for. Example: A while back there was a discussion in unix-wizards about Vaxen greater than 8 Megabytes, and the software changes needed to make it work. lookup megabyte. No luck. Lookup 8. you'll be there for months. lookup meg, or mbyte, or vax or vaxen or memory-- I KNOW that those articles are in there somewhere, but I never DID find them. So the ultimate usefulness of a keyword system lies in having a set of defined and used keywords. Trying to find things in the prototypes I hacked through simply proved to me that we don't have an agreement on the words we use for various things (you should see the number of words used to describe the C compiler, or f77, for example...) It is a lot like finding a store in the yellow pages when it isn't listed under what you would expect it to be (racquetball? Hmm.. how about sporting good? retail stores? health food?) I still have hopes I can do something for the archiver that doesn't require a dedicated 750 and an eagle to do it with. I don't know-- comments are welcome. Keywords sound like a nice thing to do-- a lot like mice, pop-up menus, user friendly software, distributed systems, and other buzz words. In the news environment, I can't find what I would consider a workable system-- perhaps we need to rethink news completely, but I'm not up to that one. It certainly looks to me like we can't just add it onto the side of what we have. chuq -- From behind the eight ball: Chuq Von Rospach {cbosgd,fortune,hplabs,ihnp4,seismo}!nsc!chuqui nsc!chuqui@decwrl.ARPA We'll be recording at the Paradise Friday night. Live, on the Death label.
bass@dmsd.UUCP (John Bass) (03/04/85)
Chuq I'm afraid is overcome by the magnitude of some of the numbers and appears to haven't thought it through well enough. The dragging performance of rnews and expire comes from TWO main issues -- doing "system(blah)" to invoke the creation of an article -- and -- the tremendous amount of time doing nami's throughout the news lib and spool areas. As for estimates of the increased space, puting the news into a real N-key database format will REDUCE the total space used by most systems due to round off in 1k block filesystems. On my 2.10.1 system each incoming batched news item appears to require in excess of 70 disk transactions (I will have exact numbers in about a week -- I think it may be double that). Done properly in an N-Key data base system with 5 keys average that number should be closer to 15 or so. 3/4 of the current disk traffic appears to be in nami and most of the rest in exec -- again I will have better numbers in a week or so. I have already experimented with replacing the "system(blah)" calls with "execl(aklj,aklj,...)" with a noticable improvement. I think that pushing news into one big database will actually reduce filespace requirements by about 20% or more. The comes from two main sources -- average wasted space is 512 bytes per article on a 1kbyte filesystem -- and -- the overhead per file is 16bytes per directory entry, plus the directory overhead (tree plus wasted directory space that is allocated) of about 20bytes, plus inode size of 64bytes, plus first level indirect overhead on long directories. Thus we have over 600 bytes overhead in the current system and a properly done data base will likely have less than 100 bytes per item. That should be a net savings of near 1.3mb on a 2700 item news database. News has outgrown the tools approach used to prototype it ... a serious rewrite is LONG overdue. With a rewrite I belive from the work I have already done that the news system can be reduced to 10% of it's current disk traffic and under 50% of the current cpu time WITH significant increases in functionality. The experience gained by implementing the notes system should have been more carefully examined years ago. John Bass -- John Bass DMS Design (System Performance and Arch Consultants) {dual,fortune,idi,hpda}!dmsd!bass (408) 996-0557
hokey@plus5.UUCP (Hokey) (03/06/85)
There is a difference between using keywords instead of newsgroups, and using keywords for archival lookup. I am primarily interested in the former. We could start with words like "request" and "inquiry", for example. These, coupled with other words (an operating system, hardware, language) would be of great help in cleaning up the way articles are plastered all over. I would suspect the first way to approach the problem is to enhance news posting programs to prompt (excuse the alliteration) for keywords in addition to a Subject line. We would have to "grow" the keyword and alias databases over time. Eventually, the keywords would be able to "take over" instead of newsgroups. We already have a situation where newsgroups fragment because the readers wish to restrict the quantity of stuff they see. Likewise, people tend to post to multiple newsgroups just to reach a wider audience. These two trends are antithetical. There are two ways to store the article index. One way is to maintain the index by message ID, the other way is by keyword. Data compression can be handled in either case. For starters, message IDs can be compressed by having a table of "registered" sites. Sites in this list would have their site name replaced by a compressed offset into the site table. Similar compression could be done to the sequence number. The keywords would be replaced by a compressed offset into the keyword table, and there would be an additional entry for each keyword for an alias. If the compression were good enough, we could keep each keyword index in an ascii file with delimiters surrounding the article IDs. This would greatly ease the deletion/insertion problem of index maintenance. If an online article database were kept, it could be scanned for keywords by grep. If people didn't like the keywords, they could easily add or delete them. -- Hokey ..ihnp4!plus5!hokey 314-725-9492
chuqui@nsc.UUCP (Chuq Von Rospach) (03/06/85)
John's comments are well taken (I won't repeat them here) but look at the problem with a different solution set than I did. I specifcally looked at at integrating a keyword scheme into the existing software, he is talking about a complete rewrite. I'd love to see a complete rewrite because it would allow us to change a lot of things we didn't know about the first time around, but I don't know if we can really find the time in a volunteer effort to do it right. If you want to look at a complete rewrite of news, I wish you luck-- that's a LOT of code, and it has to be compatible with the existing news systems at the transport layer (well, it REALLY doesn't, but you'd get really lonely if you didn't talk to all those people....). I can usually find time here and there to fix up parts of the existing software, but knowing the amount of free time I tend to have, a major development project like that would take much too long to get any results, which is why I wasn't looking at it. chuq -- Chuq Von Rospach, National Semiconductor {cbosgd,fortune,hplabs,ihnp4,seismo}!nsc!chuqui nsc!chuqui@decwrl.ARPA Be seeing you!