[news.admin] UK Copyright libraries and Usenet

emv@ox.com (Ed Vielmetti) (05/16/91)

In article <1991May16.050935.29882@newshost.anu.edu.au> cmf851@anu.oz.au (Albert Langer) writes:

   Some library somewhere certainly OUGHT to be preserving archives of
   all USEnet news groups. Since they are serial publications like any other,
   perhaps certain libraries have a statutory obligation to do so (apart 
   from NSA). In that case it may be worth drawing their attention to
   the ease with which their obligations could be carried out. (DAT tape
   is less than $10 per Gigabyte).

If you wait another week, we can keep this discussion going in
comp.archives.admin; send your votes in to kent@uunet.uu.net (Kent
Landsfield).   For now I'm redirecting the archivist perspective to
comp.sources.d (as good a place as any under the circumstances).

I'm reasonably confident that every news article that ever got to
Toronto has been squirreled away on tape.  You're right, it's cheap to
store it, but read on....

   The current informal arrangements for volunteer ftp sites to hold
   simple dumps of certain news groups and mailing lists without any 
   indexing etc are quite inadequate. A library should do it properly 
   with guarantees of permanent access and appropriate classification 
   and indexing for retrieval.

Libraries and paper-based archivists are generally ill at ease with
providing efficient retrieval for huge amounts of full-text data.
Given a million usenet articles, do you have a good way to sort
through them?  

   In fact the same applies to ftp archives generally. They are being
   administered on an unfunded basis by computer system administrators
   with no special library skills.
   --
   Opinions disclaimed (Authoritative answer from opinion server)
   Header reply address wrong. Use cmf851@csc2.anu.edu.au


No library school teaches you how to run an anonymous FTP site.  I
would argue that most of the things we consider "archive sites" are
really much closer to "samizdat" houses or medieval pamphlet shops,
run by a proprietor who has a certain charter in mind and a mission to
collect together like-minded objects (software, text, nudie pix).  The
materials are uneven, mutable, of uncertain value, hard to catalog,
and difficult to present effectively; you can't say "it's down on the
bottom shelf, the binding is green, and the three or four books to the
left of it are also good".

There are on the order of 1000 anonymous ftp sites in the whole world;
I'd estimate that means on the order of 1500-3000 individuals involved
in the process of organizing, collecting, and maintaining ftp
archives.  

The process is not entirely unfunded, to be fair; most of the raw
network bandwidth overhead needed (in the USA, at least) is funded by
the National Science Foundation, and many of the archive sites are run
from systems which have been purchased in part by gov't money of one
form or another.  Other systems have been supported by the expectation
of making money off the venture (uunet), software support for
customers (apple), or archive services for internal non-connected
networks (gatekeeper.dec.com).

What has not been funded quite yet is the production of
meta-information about where anonymous ftp sites are, what they carry,
and methods of searching through them.  It's patently absurd that the
best way for someone in Colorado to find out about archives in
California is that they send their queries to a system in Quebec
Canada (archie), over completely saturated "wet pieces of string" that
the Canadians have to pay for.  The NSF should find a way to fund that
effort or similar efforts.  Be glad that someone needed a master's
thesis, and that McGill had some idle equipment.

In the same fashion, there is a huge full-text index of comp.archives
postings somewhere in Canada, but it's not accessible to the world at
large because there isn't funding for network bandwidth or server
hardware to keep it running on the volume of queries we'd reasonably
expect to see from it.  For that matter, no one is funding the
production of comp.archives yet -- I'm looking at it as a venture that
has to be self-supporting (i.e. making money from user fees or
provided under contract to some funding agency) within the next two
years or I'll pack up shop and stop doing it.  

-- 
Edward Vielmetti, vice president for research, MSEN Inc.  emv@msen.com

"(6) The Plan shall identify how agencies and departments can
collaborate to ... expand efforts to improve, document, and evaluate
unclassified public-domain software developed by federally-funded
researchers and other software, including federally-funded educational
and training software; "
			"High-Performance Computing Act of 1991, S. 272"

herrickd@iccgcc.decnet.ab.com (05/23/91)

In article <EMV.91May16024531@poe.aa.ox.com>, emv@ox.com (Ed Vielmetti) writes:
[probably I should wait for the archive administration group, but Ed
 raised the subject now - i'm trying to keep in in news.admin]
> expect to see from it.  For that matter, no one is funding the
> production of comp.archives yet -- I'm looking at it as a venture that
> has to be self-supporting (i.e. making money from user fees or
> provided under contract to some funding agency) within the next two
> years or I'll pack up shop and stop doing it.  

Ignoring the question of why you do it, I've been watching comp.archives
and wondering HOW you do it.  You can't possibly read 500 newsgroups
looking for postings that identify repositories.  Can you?  You recently
doubled the effort of making the posting by adding that verification
of the accessibility of the material.

So, is MSEN studying artificial intelligence?  Does a VP Research have
time to be working on a dissertation about information retrieval that
is analagous to snatching a drink from a fire hose?

I'm new here (my first toe wetting was near a year ago, how many years
will i feel obliged to say that).  However, even as a newcomer, I have
some feeling for the magnitude of the service to the community at large
represented by comp.archives.  Watching the traffic in it, I don't believe
half the authors know it exists - you bring in some odd things.

dan herrick
herrickd@iccgcc.decnet.ab.com

emv@msen.com (Ed Vielmetti) (05/24/91)

In article <4660.283b7d00@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com writes:

   In article <EMV.91May16024531@poe.aa.ox.com>, emv@ox.com (Ed Vielmetti) writes:
   [probably I should wait for the archive administration group, but Ed
    raised the subject now - i'm trying to keep in in news.admin]

fine, news.admin it is.  I suppose I should x-post it over to
dnet.archiv or aus.archives or any of the other ``regional''
unmoderated groups.

   Ignoring the question of why you do it, I've been watching comp.archives
   and wondering HOW you do it.  You can't possibly read 500 newsgroups
   looking for postings that identify repositories.  Can you?  

Sure, why not?  It's a very simple first-pass filter: take in every
single article, grep it for some key words, if they key words are in
there then save it aside for human processing.  Out of 10000 articles
in a day you can narrow things down to ca. 100-150, of which ca. 10-15
will be interesting once you've read them.  You could apply the same
filtering technique in about an hour's work with grep and a sys file
entry, and look for anything you can express in a shell script.  It's
quite reasonable to scan all of the various newsgroups if you have the
CPU handy.

Determining the set of key words is slightly tricky one.  The ideal
phrase to look for for comp.archives is
  this whizzy new package is available for anonymous ftp from 
      host.domain.org:/pub/whizzy/package-1.0.tar.Z
but there are a multitude of variations on that theme.  The AI project
is going to disambiguate between that ideal phrase and "does anyone
know where i can ftp some gif files from?".  fortunately with a big
screen and a fast cpu a human can chunk through those in 15-30 seconds
each, so it's not so important.

With a big (5000+) collection of articles which have already passed
the comp.archives eyeball test, I should be able to feed potential
matches into a filter that looks for whether there are any other
similar articles in the database.  That will categorize things into
three sets:
 - noise (requests, mostly, or the bitftp discussion)
 - things which I've seen before (new releases or reviews or updates)
 - things which I've never seen before but which might be interesting
then the above-mentioned human can go through a smaller set looking
for the interesting 10-15 articles.

   You recently
   doubled the effort of making the posting by adding that verification
   of the accessibility of the material.

No extra thought-power here, just a little more time.  All of the
packages thus far identified have their home sites identified in a
database; the verification step goes off and fetches the directory
information and sets it aside.  Some shell scripts make these easy to
type, but they're basically no-brainers.

   So, is MSEN studying artificial intelligence?  Does a VP Research have
   time to be working on a dissertation about information retrieval that
   is analagous to snatching a drink from a fire hose?

No AI here, just honest work.  None of the ``information retrieval''
stuff that I've looked at so far has been very well indexed or
cataloged, so my first-pass conclusion is that existing research
efforts aren't very good.  Most of the hard-core information retrieval
engines (that cost real money ) that would do the sorts of things I'm
doing fail in interesting ways on the usenet problem; it would seem
more likely to be worthwhile to build a functioning "expert system
that reads netnews and looks for articles that belong in
comp.archives" than to go off degree-hunting.  Especially if that
expert system can be trained to go after things in the charter of
other arbitrary newsgroups.

Any pointers to other functioning "read all the news and find the good
parts" systems are welcomed -- I've heard of something called NewsPeek
apparently running out at MIT, but nothing that I know of off hand
that runs off the relatively dirty usenet wire rather than the tidy
AI, UPI, or Dow Jones feeds.

-- 
Edward Vielmetti, moderator, comp.archives, emv@msen.com

"He who hesitates is last;" "The point man takes the hits;" "It's easier to
get forgiveness than permission;" "There's no harm in asking."  Pick your
aphorism and live by it. 		             -- Stephen Wolff, NSF