[alt.hackers] hacking comp.archives and anonymous FTP

emv@math.lsa.umich.edu (Edward Vielmetti) (01/11/90)

I'm trying to discover every announcement of new freely available
software on the net, and repost said findings to comp.archives, with
nice little headers that could be used to automatically fetch the
posting.  And, I want to keep a master list of all anonymous FTP
sites in the world and how to extract information out of them as
to what they currently have.  So the end result is, when I want
to find a package, I grep a file, or send off a message to a
server that looks for me.

This decomposes down into several problems, some of which are more
solved than others.

Problem one is finding new stuff as it's announced.  Currently the
best solution seems to be the 'gnus' newsreader, armed with a KILL
file for every group that I care about.  They look something like
this:

	(gnus-kill "" "FTP" "u")
	(gnus-kill "Subject" ".")
	(gnus-expunge "X")

i.e. grep every article for "FTP", kill everything else off, clean
up the results.  It's a useful technique -- the same basic idea can
be used to enormous advantage to, say, wipe out all of comp.lang.c
except for a few interesting people.  'gnus' kill file technology
isn't perfect, but in this regard it's much better than rn.  For
some groups I also "know" frequently asked questions, etc.

Does anyone have a way to have 'gnus' process all of its KILL files
without asking, preferably from a batch file late at night (er, or
perhaps at 9 am, considering my hours of late)?  This could save me
no end of time.

I might want to use Brad's 'NewsClip' software, but it's hard to
hack on software that someone else owns, so I haven't bothered yet.

Problem two, once an "interesting" article is found, I need to 
tidy it up, put on the right headers, and send it off to comp.archives.
I should also save some stuff about the package that the article
refers to in a database of some kind as it goes out, and 
automagically add some cross-reference headers.  

The current technology for this is called 'min', a perl script which
is based on the 'mailintonews' script posted to alt.sources a long
time ago though it doesn't resemble it much now.  It does the 
right thing when re-posting news to another group -- see the results
on comp.archives, alt.sources, and alt.sources.patches.  I
mean to say, all of the headers are *right* so that replies
work, followups work, etc. unless the user has a broken mailer
that replies through Path:.  It also does host-to-name lookups
so that no one has to say "my host table doesn't have foo.org,
can you post the address for me" ?  And if the author of the
article doesn't know what the site's address is or gets it 
wrong, well there's no problem with that.

I don't do a lot of things here.  First is I don't actually go out and
FTP to the site and verify that it exists and that the files are
actually there and that they work when compiled on every system in the
world, hey that's not my job.  I do slap on an Archive-name header; I
don't drop that into a database right away, and I have no way of
making sure that I don't put in a duplicate except for my own feeble
memory.  The Keywords: line is blank.  The Summary: line is blank.  I
don't know whether to put the author's organization or my own
organization in that field so that's blank too.

Problem three is keeping track of where things are once comp.archives
gets expired.  Certainly it would be pointless to repost everything
every month; there's some duplication as it is, and it looks like 5-10
interesting things are mentioned or announced DAILY.  (That's amazing,
really.)

The current solution is the anonymous FTP list that Jon Granrose
keeps.  I did the first version of this list, posted it, and
lost interest and time to keep it up.  A piece of it was posted
in the C Users Journal without telling me (thanks guys but no
thanks).  Jon picked it up, rearranged it so that it's stored
internally on a one line per site basis, and wrote some code
to reformat the one line per site into a more readable form.
(Pick it up on pilot.njin.net.)

While that list is useful, it really doesn't tell you the right
think.  Each line is an inventory of what the site has, but
that changes rapidly and arbitrarily.  A better approach might
be to list every package, then all the sites that carry it;
that's straightforward enough to untangle with a little work.
You could even regularize things some, with the "primary" site
for a package or home of a list going first, then other archives
in an ordering which might be meaningful.  I.e.:

comp.sources.unix	uunet.uu.net, tut.cis.ohio-state.edu, wuarchive.wustl.edu, ...
nhfsstone		bugs.cs.wisc.edu
traceroute		ftp.ee.lbl.gov

Let's see, you could order from the center of the universe outward
(Ann Arbor, home of NSFNET operations, of course) or some other
inferior setting, or completely arbitrary.  If the list of words
on the right is kept orderly, it would not be hard for each of them
to have instructions attached (in a different file?) for retrieving
the README file or equivalent, i.e.

nhfsstone	(ftp bugs.cs.wisc.edu, cd pub/nhfsstone, get nhfsstone.shar),
		(mail archive-server@foo.org, subject nil, body send nhfsstone)

here there is no README broken out but if there were we'd be all set.

The beauty of all this is, I can have my 'min' program write these
databases (probably perl-based dbm's for speed, once I figure that out)
as things are reposted to comp.archives, so with no extra effort on my
part some of the work can be done; or a comp.archives archive-site
could extract the data from the headers I (or someone deputized as such)
would add and do the work post facto.

If anyone would like to help on bits and pieces of this -- there's
some details I've left out of the whole thing as you could plainly
imagine -- send me mail.  It might be enough to argue for a separate
alt group to discuss bits of the projects since I get too much mail
for my own good as it is now (and I read too much news...all of the
comp groups in some fashion except for the binaries groups and stuff
I absolutely just don't care about).

If you're still with me, the other project to soak up spare time
(again with Jon Granrose's spare time) is the idea of an RFC for
anonymous FTP.  There's enough issues there for a wide discussion
of things, and this message is pretty long as is.  Stuff to think
about is 
- vendors and hackers, what should code to do anonymous ftp look like:
  supporting ftp restart, a note on security, refusing to transmit
  binary files as ascii, logging of anon ftp commands, restriction
  of access by time of day or load, and others.
- sys admins, how to lay out an anonymous ftp area:
  readme files, tree structure, keeping up to date on fast-changing
  software, letting the author know you're distributing stuff,
  recommended file compression and archive types, a note on security,
  and others.
- users, how to do anonymous ftp:
  network etiquette, basic instructions, compression types, ascii vs. 
  binary transfers, how to search the site lists, regional archive
  sites, etc.

This is as I say very brief, more could go into it and each of these
might take a para or two to cope with.

--Ed

bill@twwells.com (T. William Wells) (01/12/90)

In article <10614@stag.math.lsa.umich.edu> emv@math.lsa.umich.edu (Edward Vielmetti) writes:
: Problem one is finding new stuff as it's announced.  Currently the
: best solution seems to be the 'gnus' newsreader, armed with a KILL
: file for every group that I care about.

I run C news here. I added a line to my newsrun script, just
before the call to relaynews, to run my own filter on the
incoming batch. This is, right now, a tiny program to just read
each message from the file and check it for some keywords I'm
interested in. It mails me the message id's for anything that has
the keywords.

For a bigger site, I'd also modify the inews script. (Being the
only poster, I certainly don't need an automatic program to catch
my postings. :-)

For a B news site, you'd actually have to fiddle with the inews
program.

Why did I do it this way? The amount of disc crunching goes way
down if you process the incoming in one big batch, which is how I
get it.

The main problem with this is that it doesn't honor cancellations
and the like, but, then again, those aren't all that reliable
anyway.

Now, the hard part is interpreting the postings. Having written a
grammar checker recently, I'm right into techniques for doing
this. I'll be giving it a whirl sometime and I'll let y'all know
how it goes in picking these things out.

: While that list is useful, it really doesn't tell you the right
: think.  Each line is an inventory of what the site has, but
: that changes rapidly and arbitrarily.  A better approach might
: be to list every package, then all the sites that carry it;
: that's straightforward enough to untangle with a little work.
: You could even regularize things some, with the "primary" site
: for a package or home of a list going first, then other archives
: in an ordering which might be meaningful.  I.e.:

I've got a format for this already defined. It isn't as compact
as yours, however.

---
Bill                    { uunet | novavax | ankh } !twwells!bill
bill@twwells.com