emv@ox.com (Ed Vielmetti) (05/16/91)
In article <1991May16.050935.29882@newshost.anu.edu.au> cmf851@anu.oz.au (Albert Langer) writes:
Some library somewhere certainly OUGHT to be preserving archives of
all USEnet news groups. Since they are serial publications like any other,
perhaps certain libraries have a statutory obligation to do so (apart
from NSA). In that case it may be worth drawing their attention to
the ease with which their obligations could be carried out. (DAT tape
is less than $10 per Gigabyte).
If you wait another week, we can keep this discussion going in
comp.archives.admin; send your votes in to kent@uunet.uu.net (Kent
Landsfield). For now I'm redirecting the archivist perspective to
comp.sources.d (as good a place as any under the circumstances).
I'm reasonably confident that every news article that ever got to
Toronto has been squirreled away on tape. You're right, it's cheap to
store it, but read on....
The current informal arrangements for volunteer ftp sites to hold
simple dumps of certain news groups and mailing lists without any
indexing etc are quite inadequate. A library should do it properly
with guarantees of permanent access and appropriate classification
and indexing for retrieval.
Libraries and paper-based archivists are generally ill at ease with
providing efficient retrieval for huge amounts of full-text data.
Given a million usenet articles, do you have a good way to sort
through them?
In fact the same applies to ftp archives generally. They are being
administered on an unfunded basis by computer system administrators
with no special library skills.
--
Opinions disclaimed (Authoritative answer from opinion server)
Header reply address wrong. Use cmf851@csc2.anu.edu.au
No library school teaches you how to run an anonymous FTP site. I
would argue that most of the things we consider "archive sites" are
really much closer to "samizdat" houses or medieval pamphlet shops,
run by a proprietor who has a certain charter in mind and a mission to
collect together like-minded objects (software, text, nudie pix). The
materials are uneven, mutable, of uncertain value, hard to catalog,
and difficult to present effectively; you can't say "it's down on the
bottom shelf, the binding is green, and the three or four books to the
left of it are also good".
There are on the order of 1000 anonymous ftp sites in the whole world;
I'd estimate that means on the order of 1500-3000 individuals involved
in the process of organizing, collecting, and maintaining ftp
archives.
The process is not entirely unfunded, to be fair; most of the raw
network bandwidth overhead needed (in the USA, at least) is funded by
the National Science Foundation, and many of the archive sites are run
from systems which have been purchased in part by gov't money of one
form or another. Other systems have been supported by the expectation
of making money off the venture (uunet), software support for
customers (apple), or archive services for internal non-connected
networks (gatekeeper.dec.com).
What has not been funded quite yet is the production of
meta-information about where anonymous ftp sites are, what they carry,
and methods of searching through them. It's patently absurd that the
best way for someone in Colorado to find out about archives in
California is that they send their queries to a system in Quebec
Canada (archie), over completely saturated "wet pieces of string" that
the Canadians have to pay for. The NSF should find a way to fund that
effort or similar efforts. Be glad that someone needed a master's
thesis, and that McGill had some idle equipment.
In the same fashion, there is a huge full-text index of comp.archives
postings somewhere in Canada, but it's not accessible to the world at
large because there isn't funding for network bandwidth or server
hardware to keep it running on the volume of queries we'd reasonably
expect to see from it. For that matter, no one is funding the
production of comp.archives yet -- I'm looking at it as a venture that
has to be self-supporting (i.e. making money from user fees or
provided under contract to some funding agency) within the next two
years or I'll pack up shop and stop doing it.
--
Edward Vielmetti, vice president for research, MSEN Inc. emv@msen.com
"(6) The Plan shall identify how agencies and departments can
collaborate to ... expand efforts to improve, document, and evaluate
unclassified public-domain software developed by federally-funded
researchers and other software, including federally-funded educational
and training software; "
"High-Performance Computing Act of 1991, S. 272"
herrickd@iccgcc.decnet.ab.com (05/23/91)
In article <EMV.91May16024531@poe.aa.ox.com>, emv@ox.com (Ed Vielmetti) writes: [probably I should wait for the archive administration group, but Ed raised the subject now - i'm trying to keep in in news.admin] > expect to see from it. For that matter, no one is funding the > production of comp.archives yet -- I'm looking at it as a venture that > has to be self-supporting (i.e. making money from user fees or > provided under contract to some funding agency) within the next two > years or I'll pack up shop and stop doing it. Ignoring the question of why you do it, I've been watching comp.archives and wondering HOW you do it. You can't possibly read 500 newsgroups looking for postings that identify repositories. Can you? You recently doubled the effort of making the posting by adding that verification of the accessibility of the material. So, is MSEN studying artificial intelligence? Does a VP Research have time to be working on a dissertation about information retrieval that is analagous to snatching a drink from a fire hose? I'm new here (my first toe wetting was near a year ago, how many years will i feel obliged to say that). However, even as a newcomer, I have some feeling for the magnitude of the service to the community at large represented by comp.archives. Watching the traffic in it, I don't believe half the authors know it exists - you bring in some odd things. dan herrick herrickd@iccgcc.decnet.ab.com
emv@msen.com (Ed Vielmetti) (05/24/91)
In article <4660.283b7d00@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com writes: In article <EMV.91May16024531@poe.aa.ox.com>, emv@ox.com (Ed Vielmetti) writes: [probably I should wait for the archive administration group, but Ed raised the subject now - i'm trying to keep in in news.admin] fine, news.admin it is. I suppose I should x-post it over to dnet.archiv or aus.archives or any of the other ``regional'' unmoderated groups. Ignoring the question of why you do it, I've been watching comp.archives and wondering HOW you do it. You can't possibly read 500 newsgroups looking for postings that identify repositories. Can you? Sure, why not? It's a very simple first-pass filter: take in every single article, grep it for some key words, if they key words are in there then save it aside for human processing. Out of 10000 articles in a day you can narrow things down to ca. 100-150, of which ca. 10-15 will be interesting once you've read them. You could apply the same filtering technique in about an hour's work with grep and a sys file entry, and look for anything you can express in a shell script. It's quite reasonable to scan all of the various newsgroups if you have the CPU handy. Determining the set of key words is slightly tricky one. The ideal phrase to look for for comp.archives is this whizzy new package is available for anonymous ftp from host.domain.org:/pub/whizzy/package-1.0.tar.Z but there are a multitude of variations on that theme. The AI project is going to disambiguate between that ideal phrase and "does anyone know where i can ftp some gif files from?". fortunately with a big screen and a fast cpu a human can chunk through those in 15-30 seconds each, so it's not so important. With a big (5000+) collection of articles which have already passed the comp.archives eyeball test, I should be able to feed potential matches into a filter that looks for whether there are any other similar articles in the database. That will categorize things into three sets: - noise (requests, mostly, or the bitftp discussion) - things which I've seen before (new releases or reviews or updates) - things which I've never seen before but which might be interesting then the above-mentioned human can go through a smaller set looking for the interesting 10-15 articles. You recently doubled the effort of making the posting by adding that verification of the accessibility of the material. No extra thought-power here, just a little more time. All of the packages thus far identified have their home sites identified in a database; the verification step goes off and fetches the directory information and sets it aside. Some shell scripts make these easy to type, but they're basically no-brainers. So, is MSEN studying artificial intelligence? Does a VP Research have time to be working on a dissertation about information retrieval that is analagous to snatching a drink from a fire hose? No AI here, just honest work. None of the ``information retrieval'' stuff that I've looked at so far has been very well indexed or cataloged, so my first-pass conclusion is that existing research efforts aren't very good. Most of the hard-core information retrieval engines (that cost real money ) that would do the sorts of things I'm doing fail in interesting ways on the usenet problem; it would seem more likely to be worthwhile to build a functioning "expert system that reads netnews and looks for articles that belong in comp.archives" than to go off degree-hunting. Especially if that expert system can be trained to go after things in the charter of other arbitrary newsgroups. Any pointers to other functioning "read all the news and find the good parts" systems are welcomed -- I've heard of something called NewsPeek apparently running out at MIT, but nothing that I know of off hand that runs off the relatively dirty usenet wire rather than the tidy AI, UPI, or Dow Jones feeds. -- Edward Vielmetti, moderator, comp.archives, emv@msen.com "He who hesitates is last;" "The point man takes the hits;" "It's easier to get forgiveness than permission;" "There's no harm in asking." Pick your aphorism and live by it. -- Stephen Wolff, NSF