emv@math.lsa.umich.edu (Edward Vielmetti) (01/11/90)
I'm trying to discover every announcement of new freely available software on the net, and repost said findings to comp.archives, with nice little headers that could be used to automatically fetch the posting. And, I want to keep a master list of all anonymous FTP sites in the world and how to extract information out of them as to what they currently have. So the end result is, when I want to find a package, I grep a file, or send off a message to a server that looks for me. This decomposes down into several problems, some of which are more solved than others. Problem one is finding new stuff as it's announced. Currently the best solution seems to be the 'gnus' newsreader, armed with a KILL file for every group that I care about. They look something like this: (gnus-kill "" "FTP" "u") (gnus-kill "Subject" ".") (gnus-expunge "X") i.e. grep every article for "FTP", kill everything else off, clean up the results. It's a useful technique -- the same basic idea can be used to enormous advantage to, say, wipe out all of comp.lang.c except for a few interesting people. 'gnus' kill file technology isn't perfect, but in this regard it's much better than rn. For some groups I also "know" frequently asked questions, etc. Does anyone have a way to have 'gnus' process all of its KILL files without asking, preferably from a batch file late at night (er, or perhaps at 9 am, considering my hours of late)? This could save me no end of time. I might want to use Brad's 'NewsClip' software, but it's hard to hack on software that someone else owns, so I haven't bothered yet. Problem two, once an "interesting" article is found, I need to tidy it up, put on the right headers, and send it off to comp.archives. I should also save some stuff about the package that the article refers to in a database of some kind as it goes out, and automagically add some cross-reference headers. The current technology for this is called 'min', a perl script which is based on the 'mailintonews' script posted to alt.sources a long time ago though it doesn't resemble it much now. It does the right thing when re-posting news to another group -- see the results on comp.archives, alt.sources, and alt.sources.patches. I mean to say, all of the headers are *right* so that replies work, followups work, etc. unless the user has a broken mailer that replies through Path:. It also does host-to-name lookups so that no one has to say "my host table doesn't have foo.org, can you post the address for me" ? And if the author of the article doesn't know what the site's address is or gets it wrong, well there's no problem with that. I don't do a lot of things here. First is I don't actually go out and FTP to the site and verify that it exists and that the files are actually there and that they work when compiled on every system in the world, hey that's not my job. I do slap on an Archive-name header; I don't drop that into a database right away, and I have no way of making sure that I don't put in a duplicate except for my own feeble memory. The Keywords: line is blank. The Summary: line is blank. I don't know whether to put the author's organization or my own organization in that field so that's blank too. Problem three is keeping track of where things are once comp.archives gets expired. Certainly it would be pointless to repost everything every month; there's some duplication as it is, and it looks like 5-10 interesting things are mentioned or announced DAILY. (That's amazing, really.) The current solution is the anonymous FTP list that Jon Granrose keeps. I did the first version of this list, posted it, and lost interest and time to keep it up. A piece of it was posted in the C Users Journal without telling me (thanks guys but no thanks). Jon picked it up, rearranged it so that it's stored internally on a one line per site basis, and wrote some code to reformat the one line per site into a more readable form. (Pick it up on pilot.njin.net.) While that list is useful, it really doesn't tell you the right think. Each line is an inventory of what the site has, but that changes rapidly and arbitrarily. A better approach might be to list every package, then all the sites that carry it; that's straightforward enough to untangle with a little work. You could even regularize things some, with the "primary" site for a package or home of a list going first, then other archives in an ordering which might be meaningful. I.e.: comp.sources.unix uunet.uu.net, tut.cis.ohio-state.edu, wuarchive.wustl.edu, ... nhfsstone bugs.cs.wisc.edu traceroute ftp.ee.lbl.gov Let's see, you could order from the center of the universe outward (Ann Arbor, home of NSFNET operations, of course) or some other inferior setting, or completely arbitrary. If the list of words on the right is kept orderly, it would not be hard for each of them to have instructions attached (in a different file?) for retrieving the README file or equivalent, i.e. nhfsstone (ftp bugs.cs.wisc.edu, cd pub/nhfsstone, get nhfsstone.shar), (mail archive-server@foo.org, subject nil, body send nhfsstone) here there is no README broken out but if there were we'd be all set. The beauty of all this is, I can have my 'min' program write these databases (probably perl-based dbm's for speed, once I figure that out) as things are reposted to comp.archives, so with no extra effort on my part some of the work can be done; or a comp.archives archive-site could extract the data from the headers I (or someone deputized as such) would add and do the work post facto. If anyone would like to help on bits and pieces of this -- there's some details I've left out of the whole thing as you could plainly imagine -- send me mail. It might be enough to argue for a separate alt group to discuss bits of the projects since I get too much mail for my own good as it is now (and I read too much news...all of the comp groups in some fashion except for the binaries groups and stuff I absolutely just don't care about). If you're still with me, the other project to soak up spare time (again with Jon Granrose's spare time) is the idea of an RFC for anonymous FTP. There's enough issues there for a wide discussion of things, and this message is pretty long as is. Stuff to think about is - vendors and hackers, what should code to do anonymous ftp look like: supporting ftp restart, a note on security, refusing to transmit binary files as ascii, logging of anon ftp commands, restriction of access by time of day or load, and others. - sys admins, how to lay out an anonymous ftp area: readme files, tree structure, keeping up to date on fast-changing software, letting the author know you're distributing stuff, recommended file compression and archive types, a note on security, and others. - users, how to do anonymous ftp: network etiquette, basic instructions, compression types, ascii vs. binary transfers, how to search the site lists, regional archive sites, etc. This is as I say very brief, more could go into it and each of these might take a para or two to cope with. --Ed
bill@twwells.com (T. William Wells) (01/12/90)
In article <10614@stag.math.lsa.umich.edu> emv@math.lsa.umich.edu (Edward Vielmetti) writes:
: Problem one is finding new stuff as it's announced. Currently the
: best solution seems to be the 'gnus' newsreader, armed with a KILL
: file for every group that I care about.
I run C news here. I added a line to my newsrun script, just
before the call to relaynews, to run my own filter on the
incoming batch. This is, right now, a tiny program to just read
each message from the file and check it for some keywords I'm
interested in. It mails me the message id's for anything that has
the keywords.
For a bigger site, I'd also modify the inews script. (Being the
only poster, I certainly don't need an automatic program to catch
my postings. :-)
For a B news site, you'd actually have to fiddle with the inews
program.
Why did I do it this way? The amount of disc crunching goes way
down if you process the incoming in one big batch, which is how I
get it.
The main problem with this is that it doesn't honor cancellations
and the like, but, then again, those aren't all that reliable
anyway.
Now, the hard part is interpreting the postings. Having written a
grammar checker recently, I'm right into techniques for doing
this. I'll be giving it a whirl sometime and I'll let y'all know
how it goes in picking these things out.
: While that list is useful, it really doesn't tell you the right
: think. Each line is an inventory of what the site has, but
: that changes rapidly and arbitrarily. A better approach might
: be to list every package, then all the sites that carry it;
: that's straightforward enough to untangle with a little work.
: You could even regularize things some, with the "primary" site
: for a package or home of a list going first, then other archives
: in an ordering which might be meaningful. I.e.:
I've got a format for this already defined. It isn't as compact
as yours, however.
---
Bill { uunet | novavax | ankh } !twwells!bill
bill@twwells.com